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ABSTRACT OF THE DISSERTATION 


On Recursive Least-Squares 
Filtering Algorithms and Implementations 

by 

Shih-Fu Hsieh 

Doctor of Philosophy in Electrical Engineering 
University of California. Los Angeles, August 1990 
Professor Kung Yao, Chair 

For many real-time signal processing applications, fast and numerically sta- 
ble algorithms for solving least-squares problems are necessary and important. 
In particular, under non-stationary conditions, these algorithms must be able 
to adapt themselves to reflect the changes in the system and take appropri- 
ate adjustments to achieve optimum performances. Among existing algorithms, 
the QR-decomposition (QRD)-based recursive least-squares (RLS) methods have 
been shown to be useful and effective for adaptive signal processing in modern 
communications, radar, and sonar systems. 

In order to increase the speed of processing and achieve high throughput rate, 
many algorithms are being vectorized and/or pipelined to facilitate high degrees 
of parallelism. A time-recursive formulation of RLS filtering employing block 
QRD will be considered first. Several methods including a new non-continuous 
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windowing scheme based on selectively rejecting contaminated data have been 
investigated for adaptive processing. 

Based on systolic triarrays initiated by Gentleman/Kung and McWhirter, 
many other forms of systolic arrays are shown to be capable of implementing 
different algorithms. Various updating and downdating systolic algorithms and 
architectures for RLS filtering are examined and compared in details, which in- 
clude Householder reflector, Gram-Schmidt procedure, and Givens rotation. A 
unified approach encompassing existing square-root-free algorithms is also pro- 
posed. 

For the sinusoidal spectrum estimation problem, a judicious method of sepa- 
rating the noise from the signal is of great interest. Various truncated QR meth- 
ods are proposed for this purpose and compared to the truncated SVD method. 
Computer simulations provided for detailed comparisons show the effectiveness 
of these methods. 

This thesis deals with fundamental issues of numerical stability, computational 
efficiency, adaptivity, and VLSI implementation for the RLS filtering problems. 
In all, various new and modified algorithms and architectures are proposed and 
analyzed; the significance of any of the new method depends crucially on specific 
application. 
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Chapter 1 


Introduction 


For many real-time signal processing applications, such as system identification, 
channel equalization, adaptive antenna arrays, spectrum estimation, etc., fast and 
numerically stable algorithms for solving least-squares problems are necessary and 
important. In particular, under non-stationary conditions, these algorithms must 
be able to adapt themselves to reflect the changes in the system and take appro- 
priate adjustments to achieve the best performance. Among existing algorithms, 
the QR-decomposition (QRD)-based recursive l east -squares (RLS) methods have 
been proven to be very useful and effective towards adaptive signal processing in 
modern communications, radar, and sonar systems [7, 9, 15, 16, 22, 23, 25. 27, 
28, 31, 30, 36, 37, 3S, 39, 40, 41, 42, 43, 4S, 47. 51, 54, 53. 55, 56, 57, 61, 65, 64]. 

A time- recursive formulation of RLS filtering employing block QRD will be 
given first. To facilitate nonstationary adaptive processing, exponentially grovv- 
ing, fixed-size sliding, and a new non-continuous windowing scheme based on 
selectively rejecting contaminated equations are introduced and compared. 

Various updating and downdating algorithms for RLS filtering are examined 


1 



and compared, which include Householder reflector, Gram-Schmidt procedure, 
and Givens rotation* When the block size of data is reduced to one, these al- 
gorithms reduce to existing scalar-type processing methods. A unified approach 
encompassing existing square-root-free algorithms is proposed. 

Systolic arrays promise to be some of the most suitable VLSI architectures 
for implementing real-time adaptive signal processing algorithms. Based on sys- 
tolic triarrays initiated by Gentleman/ Kung [ 19 ] and McWhirter [ 41 ], many other 
forms of systolic arrays are shown to be capable of implementing different algo- 
rithms. 

For some applications, like sinusoidal spectrum estimation, a judicious method 
of separating the noise from the signal is always of great interest. This involves 
rank determination and truncation in the linear system modeling of the problem. 
Some QR-based methods are proposed for this purpose and compared to other 
truncation methods. Computer simulations are provided for detailed compar- 
isons. 

In this thesis, numerical stability, computational efficiency, suitability for 
VLSI implementation, and adaptivity are the major considerations for the RLS 
filtering problems. In all, different methods are provided and analyzed whereupon 
the choice rests on the application itself. 

This thesis will focus on efficient and suitable VLSI algorithms for RLS filter- 
ing with applications to real-time signal processing problems. Attention will be 
focused on recursive vectorized block data processing VLSI structure, which gener- 
alizes current scalar-type processing structures and possibly reduces the I/O cost. 
For many architectures, computational speed is much faster compared to data 
movement; this is the reason why many algorithms are being reformulated into 
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Table 1.1: Summary of notations. 


Symbol 

Notation 

n 

time index 

p 

order of LS filtering (no. of columns/sensors) 

k 

block size 

Xi 

i-th snapshot of matrix data block 

y» 

i-th snapshot of desired (R.H.S.) data block 

Q{n) 

orthogonal range space of X(n) 

Q^in) 

orthogonal null space of X '(?i) 

w(n) 

opt. weight vector e W at time n 

e(n) 

opt. overall residual vectors, .V(?i)w(n) — y(n) G 'JZ nk , till time n 


opt. residual vector, .Y n w(?r) - y„ € TZ k , at time n 

R(n) 

triangular matrix € 7Z pxp obtained from QRD of X(n) 

u(n) 

projection of y(n) onto Q{n) 

v (n) 

residual vector by projecting y(n) onto Q L {n) 

l 

fixed- window size (no. of blocks) 

A 

block forgetting factor 

A 

block row-weighting matrix 

Z 

previously triangularized matrix with appended new data block 

C, 3 

cosine and sine values of the planar rotation angle 

C, S 

hyper-cosine and sine values of the hyperbolic rotation angle 

k\ , /; 2 

normalization factors of row data for sqrt-free algorithms 


parameters of different sqrt-free algorithms 


vectorized algorithms. Block processing also offers the advantages of reducing 
arithmetical operations and possibly minimizing the round-off errors occurring in 
the intermediate stage because of the enlarged wordlength in the internal regis- 
ters. Table 1.1 provides a list of major symbols and notations used in this thesis. 

Finally, future work and a summary of previous and new results will be given. 
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Chapter 2 

Mathematical Formulation of 
Time Recursive Least-Squares 
Filtering Problems 


For many signal processing applications, time recursive updating forms an im- 
portant and natural approach to tackle such problems. Consider the adaptive 
antenna array as shown in Fig. 2.1 [25, 16]. At each time snapshot, a k: x /> 
dimensional data X is collected by the auxiliary antennas while the main an- 
tenna receives the data y. Our goal is to find a set of weighting coefficients 
w h 3 ~ b‘**<>P* and possibly its associated residual, such that the Euclidean 
norm of the overall residual up to time ??., \\Xi • w — y : *|| 2 is minimized. If 

^ = b Gentleman and Kung [19] in 19S1 proposed a systolic array to update the 
optimum weight w(n) as time n advances, by successively updating the upper- 
triangular matrix of the QR decomposition of the augmented matrix of X(n) 
and y(n). McWhirter [41] in 1983 extended this structure by computing the 
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most recent scalar residual at each time snapshot without explicitly computing 
the optimum weight vector w(n) which entails back substitution. 


2*1 Time- Recursive Block QR Algorithms for 
LS Problems 


In this section, recurrence formula to update the optimum weighting and residual 
vectors in a block manner as a function of time are derived. Consider a time- 
recursive least-squares (LS) problem: 


A'(n)w(n) « y(n) , 


( 2 . 1 ) 


where AT(n) and y (n) have growing dimensions in the number of data blocks in 
rows (growing- window), 


X(n) = 


1 

* 




yi 



€ r R nkxp , 

y(«) = 

; 

—J 




i 

<< 

3 


€Tl nk , ( 2 . 2 ) 


with A, € 'R. kxp , i — 1, 2, • • • ,n and y ; £ TZ. k , i = 1,2, Here we denote 


k and n as the block size and the time index respectively, and p is the order of 
the LS problem. Capitalized letters (e.g., A) are used to denote matrices; small 
letters in boldface (e.g., y) vectors, and small letters (e.g., u> t ) scalars. A time 
index n is represented in the parenthesis, e.g., A(n), to denote all of the time 
span until n, or in the subscript, e.g.. A,-, the time epoch n only. For simplicity 
of notations, we also choose our data as real- valued. It is very easy to extend 
to the complex- valued cases. The previously considered scalared cases in [19. 41] 
then have k = 1. 
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The LS solution w(n) € W> is computed such that the Euclidean norm of the 
residual vector 

r 

®i(«) 


e(n) = 


e 2 (n) 


= X(n)w(n) - y(n) € 1Z 


nk 


(2.3) 


e„(n) 

is minimized. All the norms || • || of vectors or matrices mentioned are 2-norm, 
unless otherwise specified. Our interest is to find the recurrence formula for w(n) 
and e(n) as n increases. 

Suppose the QR decomposition of the augmented matrix [ A r (n) y(n) ] is 
known at time n, 


[ X(n) y(n) ] = [ Q( n ) Q x (n) } 


(2.4) 


R(n) u(n) 

0 v(n) 

where Q{n) 6 7Z nkxp and Q x (n) £ 7 ^ nfcx (nfc-p) re p resen t the orthogonal range and 
null spaces of the data matrix -Y(n), and u(n) € W > is the projection of y(?r) onto 
Q(n), v(7i) £ 7l nk ~P is its counterpart projected onto <5 x (n), and R(n) £ 7 Z pxp 
is an upper-triangular matrix and assumed to be full-rank. R(n) is sometimes 
called the Cholesky factor of the covariance matrix of A”(n) in that the Choleskv 
factorization of A’ r (n)A’(n) can be uniquely expressed as R T (n)R(n) subject to 
the signs in each rows of R(n) as long as X(n) has full column rank. 

Because an orthogonal transformation preserves the Euclidean norms of of a 
vector, it can be shown that[35] 

ll e ( n )ll = ||AT(n)w(n) — y(n)|| (2.5) 


[ Q{n) Q X (n) 


R(n) u{n) 

0 v(n) 


w(n) 

-1 


(2.6) 



(2.7) 


! Q( n ) Q l (n) ] 


R(n)w(n) — u(n) 

-v(n) 

||(J(n)(H(n)w(n) - u(n)]|| + || - <J 1 (n)v(n)|| 
|| - CJ»v(n)|| 


( 2 . 8 ) 

(2.9) 


as long as 

R(n)v/(n) = u(n) . (2.10) 

(2.8) follows from the fact that the Euclidean norm of the sum of two or more 
orthogonal vectors is equal to the sum of Euclidean norms of these vectors. (2.9) 
means that the residual vector while estimating y(n) from X(n) must lie in the 
null space of AT(rc) which corresponds well with the geometrical interpretation of 
the orthogonal principle of LS problems. 

As the time index n advances by one, i.e., a new data block of size k , 
[ X n+ i y n +i ], is acquired, we can write the recurrence formula for QRD as 
follows: 


[ X(n + 1) y(n + 1) ] = 


A"(n) y(n) 

An+l y>i+l 



_ 

Q X (n) 



' R(n) 

u(n) 


Q(n ) 

0 



0 

0 

7 * . 


0 

v(n) 






n+ 1 

Yn+l _ 


_ 


_ 


Qn-hl 

Q 


Q(n) 

Q L {n) 

0 








I-nk—p 


0 

0 

Ik 








Qn+l 

Q 


n+1 


n+1 


( 2 . 11 ) 


( 2 . 12 ) 


R(n + 1) u ( + 1 ) 

0 v(n) 

0 v n+1 

(2.13) 


S 



(2.14) 


Q(n)Qn + 1 

Qn+l 


QHn) 

0 


Q(n)Qn + 1 


Qn. 


n+1 


R{n -f 1) u(n + 1) 


= [ Q(n + 1) g i (n+l) 

By defining 


Q{n + 1) = 


R(n + 1 ) 

0 


Qt 


0 
0 

u (n + 1) 
v(n + 1) 


v(n) 

v n +i 


(2.15) 


€ 'R} p + k } x ( p+k ) 


(2.16) 


Qn+l V„+l 

Qn+l Q n +l J 

we note that Q(n + 1) constitutes an orthogonal transformation to annihilate the 
newly appended data block A' n+ i, and Q n+l € H pxp and Q n+l G 7Z kxp represent 
the operation of modifying the range space while Q„ +l G 7Z pxk , and <5n+i € R-kxk 
that of the null space. We use a hat ~ to denote the new dimensional growth due 
to the appended data. To sum up, we have the following recurrence formula: 

Q(n)Qn + 1 

Qn+l 

QHn) Q(n)Qt +l 

0 QUi 

v(n) 


Q(n +1) = 

Q ± (n + 1 ) = 


€ 7^ n+1 ^* p 


(2.17) 


6 x ("+Ufc-P |2 IS) 


v(n + 1) = 


Vn+l 


€ n { 


n+ 1 )k — p 


(2.19) 


1 

J5 

3 

+ 

u(?i + 1) 


R(n) 

u(n) 

0 


= Q(?z + 1) 

V n +1 


l 

+ 

Yn+l 


Qn+l Qn+l 


R(n) u(n) 

Qn+l Qn+l 


-^n+1 Yn+1 


.( 2 . 20 ) 
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The desired optimum weighting vector w(n+l) and the residual vector e(n+l) 
are thus given by 

R(n + l)w(n + 1) = u(n + 1), (2.21) 

which can be solved by back substitution, and 


e(n + 1) 


-Q L (n + l)v n+l (see (2.9)) 


Q L {n) Q(n)Qn+i 


v(n) 

1 

-it 

<o> 

0 

1 


. v "+! . 


-Q x (n)v(n) - (?(n)g x +1 v n+1 

— *5n+l V n+1 


e(n) - Q(n)Q% + ,v B+1 

-<2n+l V "+l 


e n (n+l)k 


( 2 . 22 ) 

(2.23) 


(2-24) 


(2.25) 


To see the changes of residuals in each previous data blocks due to a new 
observation of .Y n+1 and y n+1 , we can write down the following lemma. 


Lemma 1 (updating residual) 


ej(n + 1) 


ei(n) - Q1Q2 ■ 

* QnQn+l V n+1 

&2{ n + 1) 


e 2( Jl ) _ Q2Q3 1 

* QnQ n+1 ^n +1 

e n( n + 1) 


e«(n) - QnQh+^n +1 

e n+l(tt + 1) 


— Qn +1 V n+1 


(2.26) 

Proof. (2.26) can be derived from (2.17) and by noting that Q(l) = Q lt i.e., 


Q( 2 ) 


Q1Q2 
Q 2 
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Q( 3 ) 


’ Q( 2 )Q 3 ' 


Q1Q2Q3 

<?3 


Q2Q3 

Q3 


Q1Q2 * * • Qn 
Q2Q3 * * * Qn 


L Q 

and substituting Q(n) back into (2.25). 


(2.27) 


(2.26) explains that the overall residual vector at time n + 1 comprises of two 
parts, one of them is equal to ~<5n+i v n+n the new dimensional growth due to 
^n+i> while the other one is equal to the old residual vector at the previous time 
Ti y e(n), offset by Q{n)Q^ x v n+1 . Therefore, if we are only interested in R(n + 1) 
and/or e n+1 , then we can simply maintain the information of R(n) and u(n), 
which is usually the case for many applications such as beamforming[41, 61]. 
However, if we need to monitor all of those previously block residual vectors 
i — 1, ■ • • , n, then the previously computed range space Q(n) is still required to 
update those old residual vectors. This monitoring may aid in the determination 
of some spurious observations(rows) such that they can be deleted (downdated) 
from the LS estimation problem and mitigate the possible bias caused bv them. 
For linear regression [18, 35], this diagnosis in monitoring all the residuals is 
especially very important. Our method, following the approach first proposed 
by McYVhirter [41], provides a one-pass direct way of keeping track of all of 
the residuals, without explicitly computing w(n) followed by X(n)w(n) — y(n) 
which requires two-passes (involving the use of back substitution twice) and can 
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be objectionable from the throughput point of view. We will elaborate on this 
later in Section 2.6. 


2«2 Pseudo Cholesky Decomposition 


In this section, pseudo orthogonalization will be defined and will be required in 
explaining the downdating operations. Let Z be an (m + n) x p matrix cascaded 
from two real-valued data matrices A € ‘JZ mxp and B € TZ nxp , i ,e„ 


^ Jm/n-pseudo sample covariance matrix of Z is defined as Z T J m / n Z, and its 
corresponding J m / n -pseudo Cholesky decomposition is defined as 

R t R = Z T J m/n Z = A T A - B t B, (2.29) 


where 


J, 


m/n 


-I 


n 


£ '^( m + n)x(m+ n ) 


(2.30) 


is called a pseudo identity or signature matrix, and the p x p upper-triangular 
matrix R (assuming that R exists and has full rank) is called the J m/n -pseudo 
Cholesky factor of the pseudo-covariance matrix of Z . For convenience, from now 
on we will suppress the subscript ('), n / n in J unless it is necessary. 


2.2.1 Notations 

A J -pseudo inner product is defined as 

< u, v >j= u T Jv =< v,u >j, Vu, v£F +n , (2.31) 
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and a J -pseudo vector norm is defined as 


||u||j = Vu T Ju, V u 6 7l m+n such that u T Ju > 0. (2.32) 

An (m + n) x (m + n ) matrix H is said to be J -pseudo orthogonal if H T JH = J, 
namely, 

hf ' 

h 2 r 

A hi h 2 • • • h m+n ] = (2.33) 

h r 

AA m+n 

l|hi || j < hi, h 2 >j ■ ■ ■ < ^1 1 llrn+n > J 

<hi,h 2 >j \\h 2 \\j ••• < h 2 , h m+n >j 

hi, h m + n j <c h 2 , h m + n > j * * * ||h m + n || 'j 

Equivalently, we can say that H has J-pseudo orthogonal columns. This means 
that for any two columns of their J-inner product satisfies 

1 , if 1 < i = j < m. 

— 1 , if m < i = j < m + n. (2.34) 

0 , otherwise. 

2.3 Simultaneously Up/Down-dating RLS Prob- 
lems 

Modifications of matrix factorizations have been of great interest in many appli- 
cations [6, 45, 35], In particular, recursively up/down-dating QR decomposition 
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by adding some new and deleting some old data rows will be examined, which 
will then lead to the systolic implementations in the following chapters. 

A time-recursive up/down-dating RLS problem amounts to find [ R(n + 
l)?u(n+l) ], v(n+l) and/or e( 7i+l) from the knowledge of [ i?(n),u(n) ],v(n),e(n), 
the new data block [ A'„ +1 ,y n+1 ], and the old data block [ A' n _ <+1 , y n _, +1 ], i. e „ 


R(n) u(n) 

0 v(n) 



[ Yn+l 

X n -t+i 

[ y n — 1 


or symbolically 

x x x ••• x x 

x x • •• x x 
x • • • x x 

x x 

+ + + ••• + + 



R(n + 1) 

u(n + 1) 

U pj Downdating 

0 

v(n) 


0 

v„+, 


0 



1 

0 

0 

0 

• • • © © 


0 © 

© © 


© 

• •• © © 

Dpf Downdating 


© © 


0 0 0 

o 

© • 


0 0 0 

0 ft 


Following (2.4), then we have 


(2.35) 


(2.36) 


1 

}< 

_ J 


Q(n) Q x (n) 


R(n) y(n) 

^n+ 1 Yn+l 


h 


0 v(n) 

V * \r 




A n+ 1 y 1 

A n-(+l Yn -^+1 


h 







_ Xn-e+i Yn-e+i 
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Q( n ) Q x (n) 


H\\ H 12 Hi3 


R(n + 1) u(n + 1) 



I 


0 v(n) 

h 







Hu H 22 H 73 


0 V„+1 

h 







H 31 i /32 


0 


(2.33) 

and the extended up/down-dated I'esidual vector is given by 


e(n + 1) 


AT(n) y(n) 

■A n+ 1 Yn+ 1 
1 y n — i 


w(n + 1) 
-1 


—Q' L (n)v(n) - Q{n)H l2 V n+1 - Q(n)// 13 v n _ f+1 


— H 22 Vn+l ~ HttlLn-t+l 

— H 22 V n +\ — HzzV.n-t+1 


(2.39) 


(2.40) 


2.4 Growing-Window with Exponential Forget- 
ting Factors 

If we replace the augmented LS equations in (2.4) by premultiplying a diagonal 
weighting matrix A(n) = diag(\ n ~ l , A If., It.) to diminish the importance of 

those previous observations (rows), where A € (0,1] is a block forgetting factor, 
then the exponentially weighted residual in (2.25) will now become 


A(n + l)e(n + 1) 


Ae(u) - Q(n)Q^ +1 v n+l 

~Qt+i v n+i 


(2.41) 


where we can see that the previous residuals are gradually deemphasized by A. 
Equivalently, we may consider the weight vector w(n) is chosen such that the 
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A-weighted Euclidean norm of residual vector, 

ll e ( n )IU = ||A»-‘e t -|| 2 ( 2 . 42 ) 

is minimized. 

2.5 Sliding-Window with Fixed-window Size 

A fixed-window or sliding-window RLS filtering needs to incorporate the new 
data segments (updating) and also remove the influence of the obsolete data 
(downdating). If we denote i as the number of blocks of the fixed- window size, 
then for n > the fixed- windowed data can be written from the growing- window 
data given in (2.2) by discarding the oldest n - C blocks of data, i.e., 

A^-r+i Yn-r+i 

A»= : y (n)= ; e n (k . (2.43) 

. J y n 

In order to obtain R{n + 1) from R[n), we need to update (include) A r n +i and 
downdate (remove) A' n _ t+ i from R{n), i.e, 

R(n + l) T i?(n + 1) = R(n) T R(n) + A' n r +1 A' n+1 - X^_ e+ 1 X n . e+l , (2.44) 

where we have implicitly noticed that R{n + 1 ) T R(n + 1) = A"(n + l) T A / '(n + 
1), R(n) T R{n) = X{n) T X{n), and A'(n+l) T A'(n+l) = X(n) r X(n)+Xj +1 A r n+ i- 
A n -r-n An-<+i • Therefore, an updating operation in the direct data domain is 
equivalent to an addition in the second order domain (covariance data), while 
a downdating operation is equivalent to a subtraction. There are two ways to 
accomplish this; one is to perform updating and downdating at the same time. 
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or we can do them one by one consecutively. These will be discussed in details 
in the following chapters. 

Under time-varying conditions, much attention has been focused on schemes 
employing exponential forgetting factors, while less on fixed-windowed ones. This 
is partially due to the difficulty of downdating obsolete data encountered in the 
windowed RLS model. But, fixed-window scheme should not be precluded sim- 
ply because its computational burden. Other factors, especially fast parameters 
tracking ability, actually favors this method under some non-stationary condi- 
tions. To motivate the need for fixed-window under non-stationary condition, a 
computer experiment is given to demonstrate the advantage of the faster con- 
vergence for the fixed-window method over the method based on an exponential 
forgetting factor. 

2.5.1 An AR model computer simulation 

Consider a second order autoregressive (AR) process {u(i)} as given in [25, pp. 
204-6], where u(i)+aiu(f— l)-f-a 2 u(f— 2) = t?(i),i = 1, ..., 100, with aj = —0.9750 

and a 2 = 0.9500 and u(i) + a[u{i - 1) -f a 2 u(i - 2) = r(i),i = 101 250, with 

a i = ~ 1-5955. u(-) is a white Gaussian noise with a standard deviation of 0.1, 
except that from i = 20 to 30, v(i)'s are intentionally increased by a factor of 20 
to account for temporary noisy perturbations. This is equivalent to lowering the 
SNR by 26 dB during that interval. We note that there is also a step change at 
iteration 100 in the parameters of the AR model where a r = -0.9750 jumps to 
a[ = -1.5955. 

To make a fair comparison between the fixed-window scheme with a window 
size i to the exponentially weighting scheme with a forgetting factor A in the sense 
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that both schemes have the same self noise (i.e., fluctuation of the estimated pa- 
rameters with respect to the optimum AR parameters) [40], we choose t = 50 and 
^ = \A^ — l)/(^ + !)• 100 simulations with different noise realizations were per- 
formed. Fig.2.2 depicts the mean bias versus number of iterations in estimating 
the AR parameters <zj and a[. For the fixed-window scheme, the transient effect 
of the changes in the noise (from iteration 20 to 30) is almost completely ab- 
sent by iteration 60, while its effect on the exponentially weighted window is still 
present. Even more significant is that for the fixed-window case, after the sudden 
change in the parameter to a\ at iteration 100. convergence is reached with 
only 50 more iterations, while for the exponentially weighted window case, about 
150 iterations are required. This simple example shows that a fixed-window 
scheme is indeed more suitable for fast parameter tracking under various non- 
stationary conditions. Since the computational load of a fixed-window scheme 
is greater(with up and down datings) than that of the exponentially weighted 
scheme(with only updating), we need to be particularly concerned with its algo- 
tithmic and architectural efficiencies. This comparison shows that a fixed-window 
scheme is indeed necessary and important in speedy tracking parameters under 
some non-stationary conditions. 

Until recently, efficient downdating algorithms have been proposed [2, 48]. 
But efficient implementations and architectures of fixed-windowed RLS filtering 
are still rarely considered. In the following chapters, systolic triarrays which are 
suitable for VLSI design, are proposed to perform fixed-windowed RLS estima- 
tion. 
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Figure 2.2: Mean bias of estimating non-stationary AR parameter for fixed and 
exponentially weighted windows. 

2.6 Residual-based Selective Window 

A new algorithm performing recursive least-squares (RLS) filtering is proposed. 
Unlike a sliding fixed-window scheme, this new windowing scheme can be non- 
continuous, depending on the estimated level of observation errors( residual). By 
monitoring the residuals in a recursive manner (see (2.26) in Lemma 1), we can 
effectively remove those spurious observed data by downdating them. This algo- 
rithm is most useful when some short-time large interferences perturb the system 
occasionally. In this respect, it outperforms existing schemes, either exponentially 
growing or sliding. A computer simulation will be given to justifv this. 
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2.6.1 Robust LS estimations with outlier removal 

A conventional LS solution w(n) € W for (2.1) is computed such that the Eu- 
clidean norm of the residual vector is minimized. Notice that e,(n) £ 7Z k , 1 < 
i < n denotes the residual of the i-th data block, AT,w(n) — y,-, when we try to 
make a best fit from all of X„ 1 < i < n, to y,-, 1 < i < n. After obtaining 
the optimum weight coefficient w(n) and also the corresponding residual vectors 
e i( n ) 5 • • • i e n( n )> we can determine an index set 3 of outliers based on the crite- 
rion, of J = {j | 1 < j < n, 1 1 e j, ( r? ) 1 1 > threshold } such that those extraneous 
outliers can be removed. This is the first pass which comprises of: (1). deter- 
mining the solution of w(n) which may be obtained from either normal equation. 
Gaussian elimination or QR decomposition followed by back substitution; (2). 
finding the associated optimum residual which is obtained by plugging w(n) into 

e(n) = A’(n)w(n) — y(n). Next, remove all X, and y J? V; € J , from (2.1). Then 
we have 

A r y(n)w % y j(n) . (2.45) 

Bv using the solution of (2.45), this completes the second pass of a robust LS 
problems. 

Figure 2.3 depicts the diagram of the 2-pass robust LS problem with outlier 
removal[4S]. For the two- pass case, a QRD is performed on [A r y] followed by back 
substitution to find the tentative w(n). After the residual vector is computed, a 
decision is made to determine which data rows are to be downdated. When all 
of the outliers are downdated from the Cholesky factor R and the associated u. 
back substitution is required for the second time to solve for the final optimum 
w(n) and the associated optimum residual can be computed. 

For the one-pass case, the QRD operator needs not only to orthogonalize the 
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incoming augmented data matrix, but also to monitor all of the residual vectors, 
therefore, the computational load is higher in this case. However, by doing so, we 
have eliminate the need to compute w(n) by back substitution and the matrix- 
vector computation of *(n)w(n) - y(n). After all of the residual vectors are 
available, those outliers can be downdated and finally a back substitution is used 
only one time to compute the optimum w(n). 

2.6.2 Comparisons of performances of different windows 

A second AR model (for details, see the previous simulation case in Section 2.5.1) 
is used again to demonstrate the advantages of the new window scheme. This 
time u(i)’s are intentionally increased in amplitude by a factor of 30 from i = 55 
to 57 and also from i = 155 to 157 to account for temporary large noisy spikes. 
This is equivalent to lowering the SNR by about 30 dB during these intervals. 

Figs. -.4 and 2.5 compare the biases of estimating the AR parameters aj 
and a 2 . Figs. 2.6 and 2.7 compare the standard deviations of estimating the AR 
parameters a x and a 2 . Fig. 2.S compares the standard deviations of the residuals. 
Four windowing schemes are compared: 

1. no windows are imposed (or equivalently, forgetting factor A = 1), 

2. exponentially weighted window with forgetting factor A = ^49 /51. 

3. fixed-size sliding window with window size £ = 50, 

4. selective window with residual threshold = 1.0. 

From these figures, vve can see that the newly proposed window by selectively 
rejecting data rows with large residuals gives the least bias in tracking the AR 
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[X 


Figure 2.3: Fuo-passes and one-pass diagrams of robust LS estimations with 

outlier removal 
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Figure 2.4: Comparisons of mean bias of estimating AR parameter a, for various 
windows under noisy spikes. 

parameters and converges most rapidly. This is because this method has dis- 
carded those highly perturbed data. 


2.7 Other Issues 

2.7.1 Mixed-type window schemes 

The fixed-window of size i with an exponential block forgetting factor A has a 
row-weighting matrix A, = (A* -1 /*., •••. A/*, /,). We want to modify A R(n) by 
updating A n+1 and downdating A*.Y n _ f+1 , which implies that R T (n + l)i?(n + l) = 
\ 2 R T (n)R(n) + X^ +l X n+l - * 2t Xl m X n _ (+l . 
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Figure 2.5: Comparisons of mean bias of estimating AR parameter a 2 for various 
windows under noisy spikes. 



Figure 2.6: Comparisons of standard deviations of estimating AR parameter a { 
for various windows under noisy spikes. 
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Figure 2.7: Comparisons of standard deviations of estimating AR parameter a 2 
for various windows under noisy spikes. 



Figure 2.8: Comparisons of standard deviations of residuals of estimating AR 
parameters for various windows under noisy spikes. 
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2.7.2 MVDR: least-squares with linear constraints 

The above derived algorithms only focus on recursive LS problems with no con- 
straints. If a set of linear constraints are incorporated, after slight modifications [3S], 
they are still applicable. MVDR beamforming [9, 61, 42, 64] is one of the exam- 
ples of such applications. 


2.8 Conclusions 

Up to now, we have assumed that QRD methods are readily available while per- 
forming updating/downdating. Explicit algorithms and systolic implementations 
will be given in the following chapters, which include Householder transforma- 
tion, (modified) Gram-Schmidt orthogonalization procedure, and Givens rotation 
methods. 
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Chapter 3 

Block Gram-Schmidt 
Pseudo-orthogonalization 


3.1 Pseudo Orthogonalization Algorithms 

We denote a J -pseudo orthogonal decomposition of Z as 


Z = \ II II J 


where H € and G / 2 (m+n)x(m+n-p) constitute a J-pseudo orthogo- 

nalization matrix, and R is a p x p upper triangular matrix. Notice that R is 
indeed the J-pseudo Cholesky lactor of the J-pseudo sample covariance matrix 


of Z in that 


Z T JZ = [ R t 0 }H t JH = r t r. 

0 
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Rewriting (3.1) as 


[ Zj z 2 • . . z p ] = [ lu h 2 • • • h p ] 

leads to 

z i = r 1 1 1 

z « = r i«hi + ^ 2 ill 2 H 1- r,ih, = 23 rjihj (3.4) 

2=1 

P 

z p = 13 r it h;. 

Multiplying zf J on the left-hand-side and r n hf J on the right-hand-side to the 
first equation of (3.4), and also noticing that ||h,||3 = 1. we have 

r n = ||z,||j (3.5) 

and 

hi = z i /i'n - (3.6) 

Next, pseudo correlating or taking the pseudo inner product of the second 
equation in (3.4) with hi, i.e., premultiplying it by hfj, we have 

rn =< hi,z a >j . (3.7) 

^"22 can be obtained by taking the pseudo norm of Z 2 — rjjhi — r 22 li 2 which giv r es 

r 22 = ||z 2 - r 12 hi||j, (3.S) 


r ll f\2 - 

r 22 ■ 


•* r i P 

•* r 2p 


(3.3) 


PP 
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and h 2 is also readily known as 

h 2 = (z 2 - r 12 h 1 )/r 22 . ( 3 . 9 ) 

By continuing this procedure, at the i - th step, all of the i nonzero elements 
in the 1 - th subcolumn of R can be computed by pseudo correlating the 1 - th 
equation in (3.4) with hi, h 2 , . . . , h,_! respectively, i.e., 

r u = < hi, z, >j, 

r H = <hj,z,>j, (3.10) 

= < hi-i,z,' >j, 

and r,, and h, are given as follows: 

r i{ = || z, - r x ,h, r i _ lji h t -_ 1 ||j ( 3 . 11 ) 

and 

h, = (z , - rj.h, - • • • - rj-j .ht-iJ/ri,. ( 3 . 12 ) 

Therefore, a Gram-Schmidt pseudo orthogonalization (GSPO) procedure is 
derived as follows: 

Algorithm 1 (Gram-Schmidt Pseudo Orthogonalization) 
for j = 1 , •••,/>, do 

t = z, ; 

for i = 1 ,...,; - 1 , do 

r U =< Zj >j ; 
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t = t - r tJ h, ; 
end; 

r Jj = l|t||j ; 

= t / r jj ; 

end. 

To derive the modified Gram-Schmidt pseudo orthogonalization (MGSPO), 
we first take the pseudo inner products of every equations in (3.4) with z, and 
also notice that < h,, hj >7 = Sjj, so we have 

Il z i|l5 = r n and h 1 = Zl /r n , 

<Zi,z i>j = r n r u , (3.13) 

<Zi,z p >j = r n r lp . 

Thus, the first row of R, r u , r 12 , . . . , /’i p , and h l5 the first column of H. can be 
computed. Next, subtract ri,hi from z ,, i = 2, • • • ,p, 

Z2 = z 2 - r 12 hi = r 22 h 2 , 

i 

= 2. - rx.-hi = Y1 r 2 .h j , (3.14) 

3 = 2 

P 

z p = z p - /'iphx = ^2 rjihj . 

3=2 
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Similarly, the second row ot R. r 22 , .... r 2p , and the second column of H. h 2 , 
are ready to compute as follows: 

11^2113 = r 22 and h 2 = z 2 /r 22 , (3.15) 

<z 2 ,z i>j = r 22 r 2 ,' , (3.16) 


< Z 2 ,Z p >j — r 22 r 2p * 

Continuing this procedure. R and H can be fully obtained. This process is 
essentially a modified Gram-Schmidt version of pseudo orthogonalization method 
in that the columns of Z are successively subtracted by those pseudo projected 
vectors determined by pseudo inner products and R is computed row by row 
while H is determined column by column. A modified Gram-Schmidt pseudo 
orthogonalization algorithm is thus given as follows: 

Algorithm 2 ( MGSPO (I) ) 

for i = 1, • • • , p, do 

r « = INI; ; 

hi = z i/i'u ; 

for j = i+ 1, •••,/?, do 

r ij =< hi,Zj >j ; 

Zj Zj ; 

end; 

end. 
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This algorithm requires (m + n)p 2 multiply-and-add, p(p- l)/2 divisions and 
p square roots operations, namely, 0((m + n)p 2 ) flops. It is noted that another 
MGSPO algorithm where R is computed column by column [26], can also be 
derived in a similar way, and is given as follows: 

Algorithm 3 (MGSPO (II)) 

for j = do 

for * = I, •••,;- 1, do 

r ij =< h,,Zj >j ; 

z j = 2 i - ryhj ; 

end; 

r a = ll z il U ; 

h j = 2 :/ r n ; 

end. 


Based on the procedures above, we have the following theorem. 


Theorem 2 (Pseudo Orthogonal Decomposition) For any Z € 7£ (m+n)Xp 

r /_ 1 ' 


J ~~ i an J -pseudo orthogonal decomposition of Z, Z = HR, exists 

and is unique subject to the signs oj each row m the upper-triangular matrix R 
and the signs in the columns of the pseudo orthogonal matrix H , if and only if 
Z T JZ is positive definite. 
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3.2 Square-Root-Free Triangularization Algo- 
rithms 

For many computations, it is advantageous to reduce the number of square root 
operations or even eliminate them altogether. To this end, we can consider the 
inverse of the diagonal elements of R and decompose R into 

R = D\ /2 R d (3.17) 



1/fn 


r ii r n r i2 • • • r u^i P 

= 

l/r 2 2 


r 22 ’ * " r 22 r 2p 


1/Cpp ^ 


r 2 

L pp J 


It is noted that the operation of D J is only stated here for symbolic purpose; our 
interest is essentially to find R d (D r can be obtained from the diagonal elements 
of R d ), or, rfj , for 1 < i < p; i < j < p. A square-root-free MGSPO procedure 
to obtain the upper triangular matrix R, or equivalently R D , is given below. 

Algorithm 4 (Sqrt-Free MGSPO) 

for i = 1, . . . , p, do 

for j = i p , do 

Wn =< z„ Zj >j ; 

z : = z ; - ( r a r ij/r 2 i )zi ; 

end; 

end. 
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3.3 Downdating the Cholesky Factor 


In adaptive signal processing, it is often necessary to keep the Cholesky fac- 
tor only, and then successively update/downdate this upper-triangular matrix as 
nevv/old data rows become available. Updating via orthogonal transformations 
such as Householder transformation or Givens rotation method are well known. 
Here we only consider using the Gram-Schmidt method to perform pseudo or- 
thogonalization of 2.2S) which now becomes 


Z = 


R 

D 


r n r \2 ■■■ r Xp 



£ 7v^ p+ "^* p 


PP 


d, 


d 2 


d P 


(3.1S) 


with R G 7Z pxp being upper-triangular and D G TV' Xp the appended data rows 
to be discarded. We are interested in R G 7Z pxp such that R T R = Z T JZ = 
RTR ~ D t D. The signature matrix becomes J p/n . If we denote Z (0) = Z, then 
the sqrt-free MGSPO can be rewritten as follows: 


for i=l, • • • , p, do 


= M' u llj ; 


for ; = ?:+ 1 , • • • , p, do 


r n p ij = < Z 


(-D w (i-l) 


>J 


7 (') _ fw r,, ( * — 1 ) 

J ~ Z 3 Z . 


end; 


end. 
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Notice that 


Z (,) = ' 


= r „(•') 


€ ft(p+»)x(p-.) ? i=l .,, p _ l 


(3.19) 


In this scheme, even though we can take advantages of the zeros already in R 
to reduce the computational cost by half, the number of flops is still of the order 
°f P to third power (where p is the number of the columns of R ) , and is given 
hy 

5Z(( n + *)(p - * + 1) 4* (n + i)(p — «)] = I[p 3 + 3 np 2 + — + -]. (3.20) 

This arises Irom the computational load in obtaining the lower right R. To see 
this, it is noticed that the work to compute r tJ ? j = grows linearly with 

the index i because pseudo inner products of size n + i are required in computing 
the z-th row of R. This load imbalance among row computations while down- 
dating(or updating) a Cholesky factor makes the MGS methods less favorable 
especially under massive-data(very large p) parallel computing conditions. An- 
other drawback of this unmodified MGSPO(same for GSPO) is the difficulty in 
implementing an efficient VLSI architecture to accomplish downdating, although 
the sqrt-free computation is very attractive. To circumvent these difficulties, the 
previous algorithms must be reformulated to reduce the order of computation 
and hopefully also to facilitate VLSI processing. 

Next, we will reexamine the operations involved in computing r s and succes- 
sive modification of the appended data block D. Then an improved algorithm 
can be derived. At each step, one row of the updated Cholesky factor will be 
obtained. The key idea is to represent the modified data in the old Cholesky 
factor in terms of the modified appended data. To clarify this idea, we will derive 
this algorithm in the following discussion step by step. 
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Step 1. 


Hi r l2 
r 22 


r lp 

r 2p 


z (0) = 


d<°> 4°> 


' VP 

• dW» 


■U {hi, • • • , h P } 


(3.21) 


r (1) 

Jl) 

r 12 ’ 

r 2 P 

r 22 

’ • ? ’2p 


= Z (1) 



T PP 


.. d (D 


{ hi , • • • , h P } can be computed, while { r$ . • • ■ . } and {di 1 * , • • • . d*, 1 ' ) 

are modified as follows: 


f 2 - 

'll — 


rn r\j = 


i(i) 


.(i) 


i i ~ 


r 2 _d (0,r d (0> 

'll a i a i • 

r r V - 0 

r ll r lj _ d l J = 2, • • • ,p, 

d (o) _ hihj d ( 0> 9 ... 

J '2 1 ’ J “i 'Pi 

1 11 

rnhi • o 
r H r n - J = 2, •••,/>• 

r ii 


(3.22) 

(3.23) 

(3.24) 

(3.25) 


36 



What we need now is to express rj^ in (3.25) in terms of dj 1 * in (3.24). 
With (3.22) and (3.23) substituting in (3.25), it can be shown that 


with 


Step 2. 


r" = i-drd' 0 ’ = f, T d< 01 , j= 2 ,...,p, 


1 


f, = — d<°>. 

r ll 


z (1) = 


tfdP ffd^ ... ffd^ 


r 22 r 23 

r 33 


dl 1 ) di 1 * 


• r 2 P 
• r 3p 

r PP 

■ dW 


(3.26) 


(3.27) 


-U { r-22, • • • . f~2 P } 


(3.28) 


f, T d™ ■ 

•• W 

rg ■ 

. . r ( 2 ) 
1 2 p 

^33 

r 3 p 


r pp 

d (2) • 

a 3 

• • d< 2 > 

P 


= Z (2) 
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{ r 22 , • • ‘ , f 2p } are computed and {rj? , • • • , } .{rg* , • • • , r< 2) } and 

{d 2 5 ' ’ * i d p 2 ^ } are modified as follows: 


f 2 
7 22 

= 

r\, - di" r dy’ + d< ,|r f 1 f,rd<*> 



= 

ri ! -d< l|T (/-f,fi r )di , >, 

(3.29) 

^22^2 j 

= 

r 22 r 2j - d^ 1)T dJ 1) 

+ d' ,| 'f l f/d;" 



= 

r 22 r 2j - d 2 1)r (/ - 

-f 1 f, T )d' 1) 1 j=3, ..., P , 

(3.30) 

d f 


,(i) r 22 r 2 j (j) 

U J ? V2 °2 

f 2 *> 

, j = 3, ■ • • , p, 

(3.31) 

4? 

= 

„(1) ? 22»*2j Jl) 

r U j,2 '12 - 

7 22 

= f, r (dS" ™'d<») 

r 22 



= 


■iP. 

(3.32) 

r (2) 

r 2; 

= 

r 22 r 2j 

r 2j , 2 r 22 , 

r 22 

j = 3, • • • , p. 

(3.33) 


Again, by using (3.29, 3.30, 3.31), r 2 2 ^ in (3.33) can be expressed in terms 
of dj 2 ' in (3.31), namely, 

(2) _ 1 r f 22 r 2i 2 

2j _ r Jr 22 r 2j ^ r 22 ] 

= ^{»-22% + d< 1)r (/-f 1 f 1 r )d< 1) 

+ d"’V - f.f/'Jdi"]} 

r 22 

= — [di I ' r (/-r,^)][d<"-i^iid« 21 ] 

r 22 r 22 

= — 4" r (/-f,fi-)d« 11 

r 22 

= fa’dj l \ j = 3, (3.34) 

with 

f 2 = — (I-fjfidP. (3.35) 
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Step 3. 



f, r a', 31 • 

f 2 T d< 3) ... ffd«' 
f 3 T d' 3 » .. f’-dj, 3 ' 

^ 44 * f 4p 

T PP 

di 3) ••• d< 3 > 

{ r 33 ,•••, r 3p } are computed and {r ( , 3) r\ 3 J } ,{r 2 ^ ,•••. r^J } 

>{ r 3? i ■ • • i r 3p' } and (d*, 3 * , ■ ■ • , dj, 3 ^ } are updated as follows: 

*33 = ^-dfdf+d^f.ffdf+dff^df 

= r 33 - d^V - - f 2 f 2 r )d< 2 \ (3.37) 

^33^3 j = r 33 r 3j — di 2|r d« + d^f.^df 



39 



+4 2,r f 2 f 2 r d; 21 

= r»r 3> - d^ T (I - f,f, r - f s £f) d ; 21 (3.3S) 

d< 3 > - H< 2 ) f 3j ,( 2 ) 

J ~ dj _ ’ ; = 4, '"’ p ’ (3.39) 

4? = tW - ^d< 2 >) 

r 33 

= f, r df . j = 4, • • • , p, (3.40) 

4? = f/(df - ^d< 2 ») 

r 33 

_ 4»7j(3) 

- h d j 1 J = 4,...,p, (3.41) 

_<3) - „ f 3i 

r 3i - r 3j - — r 33 , J = 4, • • • , p. (3.42) 

Now, rg* in (3.42) can be written as 


4? = ^’'(/-f.^-f^df (3.43) 

= f 3 Td j 3) i j = 4, •••,/>, ( 3 . 44 ) 

where 

f 3 = - f.ff - f 2 f 2 r )df. (3.45) 

Proceeding in this way, it can be shown that 

Step i 


-9 2 

r = r z 

it 9 


-d| f,_ 1 f?: 1 )d|*- 1) , 


wu = r ‘« r o - d S , " 1)r (/ - fif x r f,_ 1 f^ 1 )d;- i) , 


(0 _ j(i-l) 
'j ~ 

_ IiidS‘- |) 

r» 

? 

.(»') _ fT j(i) 

fcj ~ h d > 


+ 

* 

ii 

III 

1 

1 

C^-i 

i 

- ft-.fi.jd!'- 


(3.46) 

(3.47) 

(3.48) 

(3.49) 

(3.50) 
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By further defining two new quantities, 


r iifi € Tv", i 

II 

►— * 

(3.51) 

1 

1 

1 

(N . 

II 

h— 1 

(3.52) 

of formula can 

be easily derived from (3.51) and 

(3.50), 

7 

'xT 

7 

6~ 


(3.53) 

Gi-i - tiff = 

T 

r' &*&* -rinxn 

i 2 € /C , 

7\, 

(3.54) 


hence (3.46) and (3.47) become 

-2 2 

r„ = r„ - d) g, , 
r»rij = r tt r t j - d*' -1)r g, . 

A new MGSPO algorithm with rank-n downdating is thus given below. 
Algorithm 5 New MGSPO rank-n downdating algorithm 
[njtializgium : 

= d, , i = 1,. . • ,p. 

Go = K- 

Recn rzinrr 

for i = 1, • • • , p, do 

g. = G,-! dS‘ _1) ; 

r dt l) = rl-dt l)T Si ; 

Gi = Gi-i - ; 

' »» 

for j = i + 1, • ■ . ,p, do 


(3.55) 

(3.56) 


T'ii 1 ij ^tt j 


-d!- 11 ' G,_, dj 


(- 1 ) 
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= r .. P .._ d </-i>v . 

' *•' IJ U J gl ) 

d {') = d (-l> _ ruji d (,-l) 

J J r * . i » 


end; 


end. 


It can be shown that this algorithm requires 

V p 

^{n 2 + n + (n 2 + n) + [ (« + «)]} = np(p + 2n + 1) (3.57) 

,=1 ;=i+i 

flops. If p » n, this new method needs about 0(np 2 ) flops, while the unimproved 
MGSPO method needs about 0(np~ + p 3 / 3) flops. Therefore, asymptotically 
when 2p 2 + 3p > 12n 2 + Qn , it is more efficient to use this new method. As for 
HGR [2], it can be shown that it needs 

n • Hr 2 + hp - *)] 

t=i 

multiply-and-add, np square-root and 2 np division operations, or, 0(np 2 + 1.5np) 
flops. HHT [48] requires 

Yl[( n + 3) + 2 (n + l)(p - i)] 

t=i 

multiply-and-add, p square-root and {p 2 - p)j 2 division operations, or, 0(np 2 + 
1.5p ) flops. Our newly reformulated MGSPO becomes very attractive especially 
when p is much larger then n among existing methods. 
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3.4 Simultaneously Up/Down-dating Multiple 
Rows 

Similar to the hyperbolic Householder transformations(HHT) proposed by Rader 
and Stemhardt [4S], our MGS pseudo orthogonalization can also perform rank-/t 
updating and rank-f downdating simultaneously. 

Let 



then an algorithm for simultaneously up/down-dating the Cholesky factor R can 
be derived in the similar way. 

3.5 Block Systolic Triarray Using MGSPO 

Fig. 3.1 is a block MGSPO systolic array without square roots. The boundary 
and regular processor elements (PEs) of a block MGSPO systolic array without 
square- roots work as follows: 

Boundary PE: 

= Gk-xd,; 


g. 

Gi 
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Regular PE: 


= rl-dfgi. 


r ii r ij dj g 

J / / 

- d J - r it r ij^JJ 
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Chapter 4 


Block Hyperbolic Householder 
Transformation 


For vector-valued processing, Householder transformation is more desirable as 
compared to the scalar- valued Givens rotation when performing orthogonal trans- 
formations either from the viewpoint of computational flops counts or the accu- 
mulated roundoff errors under finite computations [17, 46, 5S, 63], 

Gentleman and Kung [19] in 1981 proposed a systolic triarrav to perform 
QRD using Givens rotation; later McWhirter [41] in 1983 extended this structure 
by propagating data along the diagonal cells to obtain the most recent residual 
data. Kalson/Yao [31] in 1985 and Ling/Proakis [37] in 1986 derived a similar 
systolic structure using the MGS method. These systolic triarravs required 0(p 2 ) 
processors, which may be objectionable for VLSI design especially when p, the 
order (i.e., the no. of columns) of the LS problem, is very large. Rader [47] in 1988 
proposed a wafer-scale linear systolic array to reduce the number of processors 
down to p/2. All of these works only deal with recursive updating QRD. 
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Now, a class of systolic implementations performing up/down-dating are pro- 
posed. They are all based on block-data processing, hence generalize all previous 
works in this area. 

4.1 Hyperbolic Householder Transformations 

Rader & Steinhardt [48] in 1986 proposed a hyperbolic Householder transfor- 
mation (HHT) to simultaneously perform up/down-dating. We will propose a 
systolic architecture implementing HHT in Sec. 4.2. To begin with, let us define 
a J — hyperbolic Householder matrix Hj as follows 

Hj = J - 2hh r /||h||3 , (4.1) 

where h is a column vector, J is a pseudo identity matrix and < ,■ >j is the J- 
pseudo vector norm as defined in (2.30) and (2.32). We note that Hj is Hermitian 
and J-pseudo orthogonal, namely, 


Hj = Hj 


(4.2) 


and 


Hj J Hj = J . (4.3) 

We can compress all of the J-pseudo energy of a vector a into its j-th entry 

by premultiplying it (performing pseudo orthogonal transformation) by Hj and 
choosing 


h = Ja + au j 


(4.4) 


with 


a = (±Oj/||aj||)||a||j . 


(4.5) 
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Here u, is a unit vector with all zeros except for its j-th entry. Then we have 

Hj a = -au ; . (4.6) 

4.1.1 Hyperbolic Householder transformation with up/ 
down-dating 

An algorithm using HHT to update A = [a 1? . . . , a p ] and downdate B = [b lt . . . , b p ] 
from the Cholesky factor R is presented below. 

Algorithm 6 HHT Up /downdating 

fori = 1 do 

ru = \frf, + a^a, - b^b, ; 

if r a < 0, r« = -r„ 
for j = (« + 1), • • • ,p, do 

= ( r « r <j + afa, - bfb j)/ru ; 



if r a < 0, a j a ; ; b ; = — b_,y 

end; 

end. 

4.2 Block Hyperbolic Householder Systolic Ar- 
ray 

Because of the nature of the hyperbolic Householder triangularization procedure, 
it can be shown [38] that Q L = H { j )L ■ ■ -H { f )X , where H { J )L <= H 2kx2k is the 


4S 



lower right submatrix of the hyperbolic Householder reflection 


out the j th column of appended data 


X + y + 
X- y- 


matrix in zeroing 


, with [X + y+] and [X~ y"] 


representing the new and old data block to be up/downdated respectively. The 
residual vector e therefore can be written as e = - H J 1)A ■ . - which can be 


computed by a series of backward matrix- vector mu!tiplications[38]. This systolic 
structure will be considered below. We note that If the block size k is equal to 1, 

then it reduces to that of Gentleman k Kung’s [19] and McWhirter’s [41] Givens 
algorithm. 


A block HHT systolic array for RLS filtering is given in Fig. 4.1. A modifi- 
cation suggested by Tsao in 1975 can be used to slightly reduce the flop counts 
and roundoff errors as well as make a two-level pipelined implementation become 
feasible [38]. Fig. 4.2 depicts the modified boundary and regular processors. 
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(x + ,x~) 
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r 1 
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y + 

y" 


- 1. • u 


unit delay 


Figure 4.1: Block hyperbolic Householder systolic array 
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Figure 4.2: Modified processing cells for Block HHT systolic array 
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Chapter 5 

Planar and Hyperbolic 
Rotations 

A dual-state systolic structure is proposed to perform joint up/down-dating op- 
erations encountered in windowed recursive least-squares (RLS) estimation prob- 
lems. It is based on successively performing Givens rotations for updating and 
hyperbolic rotations for downdating. Due to data independency, a series of 
Givens and hyperbolic rotations can be interleaved and parallel processing can 
be achieved by alternatively performing updating and downdating both in time 
and space. This flip-flop nature of up/down-dating characterizes the feature of 
the dual-state systolic triarray. To further reduce the complexity and increase 
the throughput rate, Cordic cells can be used to mimic the operations of row- 
broadcasting where only one sign bit is propagated along each row of processors. 
Efficient implementation on the evaluation of optimal residuals and a transforma- 
tion of the hyperbolic rotation to an algebraically equivalent orthogonal operation 
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to provide a more stable implementation are also considered. This systolic ar- 
chitecture is promising for the VLSI implementation of fixed size sliding-window 
recursive least-squares estimations. 


5.1 Introduction 

Consider a least-squares (LS) problem at time n given by, 

A'(n)u;(n) ss y(n), (5.1) 

where A'(n) is an i x p fixed-windowed data matrix with real-valued elements 
taken either from a single or p time-indexed multichannel data sequences. 


€ 3?' Xp , (5.2) 



£n-*+l t l 

1,2 

’ ' ' €+l,p 


x «-*+l 

IT 

ll 

X n-^+2,l 

X n-^+2, 2 

x n-M-2,p 

— 

2 


*n, 1 

«n, 2 

*^n t p 


x r 


and y(n) is the real-valued t x 1 desired response vector, 


y{n) = 


Vn -(+ 1 
Un -(+2 

Vn 


€ 


(5.3) 


We denote i as the window size, p as the order of the system (possibly the number 
of sensors in a multichannel filtering problem) and n is the time index(n > i is 
assumed). The LS problem is to find apxl optimum coefficient vector w(n) 6 9? p , 
such that the Euclidean norm of its associated residual 


e(n) = X '(n)w(n) — y(n) 


(5.4) 
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is minimized. If X(n) has full column rank, then it is well known that from the 
normal equation (NE) approach w(n) is given by 

w(n) = (A T (n)X(n))~ 1 X T (n)y(n). ( 5 . 5 ) 

The increased dynamic range requirement (due to the squaring of the condition 
number [ 21 ]) precludes the NE method for some critical applications in modern 
digital signal processing. Therefore, in order to achieve the same computational 
precision, direct matrix factorization methods employing orthogonalization to 
preserve the condition number (such as the QR decomposition (QRD)), are pre- 
ferred especially when it is likely that the numerical instability may arise due to 
ill-conditioning. Furthermore, the NE method of (5.5) is limited to block opera- 
tions, while the QRD method is amenable to recursive operations implementable 
with various parallel and systolic architectures. 

Recently, some efficient up/ downdating algorithms have been proposed [ 2 , 48, 
8 ]. But work on efficient implementations and architectures for a fixed-windowed 
RLS filtering with such up/downdating is still fragmentary. In this paper, we 
propose two systolic arrays[33, 44, 34], which are suitable for VLSI designs, to 
perform fixed- windowed RLS estimation. The first one is denoted as the dual- 
state systolic triarray, which resembles Gentleman and Kung’s triarray [19] with 
the same hardware complexity, except the clock rate of the processor is set two 
times higher. The second one is realized by using Cordic cells to reduce the hard- 
ware complexity. Efficient schemes to obtain optimal residual have not been fully 
addressed for the windowed RLS estimation. Along this direction, we consider 
the feasibilities and limitations based on systolic implementations. A transfor- 
mation of the hyberbolic rotation to a more stable orthogonal operation is also 
considered in this chapter. 
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In Section 5.2, the basic up/downdating RLS estimation is considered, fol- 
lowed by the dual-state systolic architecture in Section 5.3, and Cordic processor 
implementations in Section 5.4. In Section 5.5, we consider the recursive estima- 
tion of optimal residual with systolic implementation. Finally, in Section 5.6, a 
transformation of the hyperbolic rotation to a more stable orthogonal operation 
is derived. Conclusions are then given in Section 5.7. 


5.2 Windowed RLS Estimation 


Suppose at time n, the QRD of [A» i t/(n)], where X(n) and y(n) are given by 
(5.2) and (5.3) respectively, is available. Then 


(?(n)[A'(n):y(n)] 


R(n) : tt(n) 

0 : v(n) 


(5.6) 


where Q{n) € R e * ( is orthogonal and R(n ) € & pxp is upper triangular. Thus the 
optimum w(n) is given [21] by 


R(n)w{n) = u(n). ( 5 . 7 ) 

R(n) is called the Cholesky factor of A' r (n)A» in that R T (n)R(n) = A' r (n)A'(n). 
The Cholesky factor can be obtained by computing the p x p sample covariance 
matrix X T (n)X(n) first, followed by Cholesky decomposition. But, this method 
squares the condition number in forming the covariance matrix. A numerically 
more stable approach is to perform QRD directly on the data matrix X(n) and 
in this way the condition number of the LS problem is maintained. 

Now at time n + 1, we want to obtain R(n +1), u(n + 1) and hence w(n 4-1) 
with the minimum effort. If the window size is growing, then we can simply 
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update R(n) by p Givens orthogonal transformations to zero out x£ +1 and obtain 
R(n + 1) [41]. But with a fixed sliding window scheme, in addition to zeroing 
out the new data row x£ +1 by orthogonalization, it is still necessary to downdate 
the obsolete data row, x£_ <+1 . We define updating as a series of Givens rotation 
operations such that an additive rank-one modification of the Cholesky factor 
is accomplished, and downdating as hyperbolic rotation operations such that a 
subtractive rank-one modification is made. It is noticed that at time n + 1, the 
data matrix 


X(n 4* 1) = 


V r 

2 


n 

T 

n+1 


(5.8) 


is obtained by adding a new row data x£ +1 and removing an old row data x£_ <+1 
from A'(n) as can be seen by comparing (5.2) with (5.8). Since R T (n + l)R(n + 1) 
= X T (n -(- l)AT(n -f 1), we have 


R T (n + l)R( n + 1) = R T (n)R(rt) + x n+l x£ +1 - x n _* +1 x£_ m . (5.9) 

Rader and Steinhardt [4S] proposed a hyperbolic Householder transforma- 
tion to update multiple new data rows and downdate multiple undesired ones 
simultaneously. Alexander et al. [2] suggested performing orthogonal rotations 
for updating followed by hyperbolic rotations for downdating. It can be shown 
that hyperbolic rotation is merely a degenerate case of hyperbolic Householder 
transformation, if we do not distinguish a rotation matrix from a reflection ma- 
trix [21], This is analogous to a Givens rotation can be considered as a special 
case of a Householder transformation. To facilitate systolic array processing, we 
will adopt the latter approach for windowed RLS filtering which involves only 
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scalar computations. 


5.2.1 Up/ down-dating Cholesky factor 

The basic up/downdating of the Cholesky factor is considered in this section. 
Given [fl(n):u(n)J, we can obtain [i?(n + l)iu(n + 1)] by first updating 

R(n) : u(n) 

*n+l : 2/n+l (5.10) 

X n— <?4-l : J/n-f+1 

via p Givens rotations, z.e., 


R(n) : u(n) 

G P,P+i--G 2 , p+l G hp+l x £ +1 i yn+1 

X n-<+l ’ 

R{n + 1 ) : u(n + 1 ) 

0 ! ui(n + l) , (5.11) 

T 

X n-f+l : !/n-(+l 

then downdating the right-hand-side via p hyperbolic rotations, i.e.. 



R(n + 1) 

Hp,p+2 ' ’ ' ^2,p+2^fl,p+2 0 

v r 

A n-<+l 

R(n + 1 ) : u( n + 1 ) 

0 : ^(n + 1) • 

0 : u 2 («+l) 



u(n + 1) 
vi(n -f 1) 

J/n-f+1 

(5.12) 


57 



ere a (p + -) x (p -f 2) Givens rotation matrix G liP+1 is used to zero out the 
{p+ 1, *)-th element of the matrix in (5.10), he., 


G{. 


p+i 


: 


I 


• 


• 

a, 

— 

Ci Si 

I 


Ct, 

— 

\j(X 2 i + Q&.1 

a P +i 


-Si Ci 


a p+ 1 


0 

. 


1 


* 


* 


(5.13) 


where 


c< - or,-/ y/ar? + a 2 p+l and s; = a p+ i/y/o^ + a^. (5.14) 

Similarly, a (p + 2) x (p + 2) hyperbolic rotation matrix H ilP+2 is used to zero out 
the (p + 2, i)-th element of the matrix in (5.11), 


Hi, 


P+2 


a. 


. <*P+2 _ 



Cf s t 

I 

-Si Ci 


a. 


^P+2 


\/ a l ~ a 


2 

P+2 


(5.15) 


where 


~ a *7 \J a2 i ~ «p+ 2 and Si = a p+2 /^/a? - «2 +2 . (5.16) 

Since Gi.p+i only affects the i-th and (p + l)-th rows of the matrix in (5.10), 
and H ,, p+ 2 affects the i-th and (p + 2)-th rows, we can combine (5.11) and (5.12) 
in the following manner. 


H. 


P.P+2 ' ' ■ H2,p+2H\,p+2G p'p+i • • • Gl, 


p+1 


R(n) 


'■n+l 


L <-<+1 


: u(n) 

: J/n+l 

: Vn-e+l 
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(5.17) 


R{n -f 1 ) : u(n + 1) 

= 0 : Vl ( n + 1) . 

0 : v 2 (n + 1) 

5.3 Dual-State Systolic Triarray 

Similar to the systolic QRD triarray proposed by Gentleman and Rung [19], which 
only performs updating, a dual-state systolic triarray performing both updating 
and downdating is given in Fig. 5.1. In a multi-channel filtering problem, for 
every sensor (i.e., column of the data matrix) there is a delay buffer of window size 
i to queue up the data. Therefore each data will be first fetched and processed 
(updated) and then stays in the queuing buffer for l data clocks and finally will be 
reprocessed (downdated) by the triarray. Before the skewed data rows enter the 
arrays, there is an array of selection switches that alternatively take in new data 
and old data. The clock rate for the processors is set at twice the input data rate 
so that both new and old data can be processed within one data clock. We use a 
black circle • to denote a processor working on a Givens rotation( updating) and 
a white circle o to denote a hyperbolic rotation(downdating). We also note that 
only one control bit is required in determining whether updating or downdating 
operation needs to be performed. 

To this dual-state systolic triarray, data rows are skewed with updating and 
downdating data interleaved to form a sequence of up/down-dating wavefronts 
which will then impact upon this triarray sequentially. All of the wavefronts are 
consistent, i.e., the involved processors will all perform updating or downdating 
according to the underlying wavefront. As one updating wavefront finds its way 
along the triarray, one downdating wavefront follows immediately behind, and 
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then followed by another updating wavefront, and so forth. 

Every processor, after experiencing one updating wavefront, will switch from 
updating to downdating operation as the next downdating wavefront will pass 
through it immediately following the previous updating wavefront. Therefore, all 
processors perform updating and downdating successively. Thus they are doing 
flip-flops in lime, which characterizes the temporal duality of this systolic triarray. 

A spatial duality can be also observed as follows. While a processor is per- 
forming updating, all its adjacent processors, either vertical or horizontal , but 
not diagonal neighbors, are performing downdating. As an example, consider the 
(2,3) processor in Fig. 5.1. When this internal cell is performing updating, its 
right neighbor, the (2, 4) internal cell, and its lower neighbor, the (3, 3) boundary 
cell, are being impacted by the downdating wavefront just before the updating 
wavefront that impacts upon the (2,3) internal cell (recall that up/downdating 
wavefronts occur consecutively). Similarly, its left neighbor, the (2, 2) boundary 
cell, and its upper neighbor, the (1,3) internal cell, must be performing down- 
dating, too, as these two neighbor cells are confronting the downdating wavefront 
which follows right after the updating wavefront associated with this (2,3) cell. 
We therefore say that this triarray also performs flip-flops in space. 

In all, for each time snapshot, we see all processors are doing updating and 
downdating evenly distributed over the entire triarray, and for the next snapshot, 
they change their roles. The phenomenon of flip-flops both in time and space 
characterizes the dual-state systolic triarrays. The wavefronts for the updating 

and downdating then propagate pairwise toward the lower-right direction in the 
triarray. 
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5.4 Cordic Processors 


Cordic (coordinate rotation digital computation) processors [62, 1, 14, 47] have 
been shown to be able to efficiently perform Givens and hyperbolic rotations 
with simple operations like add, subtract and shift, and one fixed-number 
multiplication. 


5.4.1 Givens rotations 

First consider the determination of the rotational angle # , such that a vector 
[a,6] r is rotated into [j^W + 6 2 ,0] r , he., 


iJlVo* + 6* 


cos # 

sin# 


a 

0 


— sin# 

cos # 


b 


(5.18) 

We can split 0 approximately into N predetermined minirotational angles with 
the proper choice of the directions of these angles, such that each minirotation 
only involves additions and binary shifts. To see this, a recurrence of minirota- 
tions can be written as 


a t'+l 


cos#, sin#, 


a i 

^i+1 


— sin#, cos#, 


. V 




1 tan#,- 1 


= cos #, 


= cos #, 


— tan #, 1 

a, -f pi> ~'bi 

Pi— <a i 4* b{ 


a { 

bi 


,i = 0,1,* • • , iV - 1, 


(5.19) 


where 


eto 

b 0 


a 

b 


(5.20) 
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and 


tan0; = /5,2 

The planar sign bit p, is determined by 


Pi - < 


1 , if ciibi > 0, 

[ — 1 , otherwise, 


(5.21) 


(5.22) 


and the intentional choice of the minirotaion angle 0, in (5.21) renders the 
shift-by-i bits (multiplied by 2~') operations. 

If the number of minirotation stages N is large enough, it can be shown [62] 
that 


a/v 

= (ffcos*,.) 

/ AT-1 

n 

i 

tan 0, 

) 

a 

b n 

V isO / 

\ ,=o 

— tan 0i 

1 

J 

b _ 


0 


(5.23) 

(5.24) 


— n? = o l cos#, 0.60725 is called a planar rotation correction factor and is 
usually independent of N when N is large enough. The rotational angle is thus 
uniquely determined by the planar sign bits p{ s, 

yv-i 

0 ~ XI Pi tan ' 1 2 ~'- (5.25) 

i=0 

In our updating scheme, it is necessary to apply the same rotation to all the 
subsequent data on the two involved data rows (i.e, with one being in the triarray 
and the other the new data row being updated). In fact, it is not necessary to wait 
until all the minirotation planar sign bits pi s are generated from the boundary 
cell. In order to take advantage of the fact that all the subsequent data on these 
two rows of data are to be rotated in the same manner as that in the boundary 
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cell, we can pipeline these minirotaion angles to the internal cells as soon as 
they become available. Therefore every time a planar sign bit is generated by the 
boundary cell, it can propagate to the rest of its right-hand-side internal cells such 
that the others can start doing minirotations as soon as possible. Thus along the 
horizontal direction, the clock rate of the mini-clock of the Cordic cells is iV + 1 
times the rate of the vertical direction, which is set equal to twice the incoming 
data rate. Because the systolic miniclock rate along the horizontal data rows is 
much faster than the incoming data rate, we can consider the Givens rotation as 
almost simultaneously applied to every data on these data rows, or, the rotational 
angles are being broadcast along the remaining internal cells in the same row. 


5.4.2 Hyperbolic rotations 


For the same reason as above, a sequence of mini-hyperbolic rotations can be 
found to accomplish a hyperbolic rotation as follows: 


hW - V 


cosh 0 — sinh <j> 


a 

0 


— sinhfl!> coshd> 


b 


Now, a recurrence of mini-hyperbolic rotations are given as 


a i+i 
6i+ 1 


cosh <j>i — sinh <f>i 


a, 

sinh 4>i cosh <?>,■ 


bi 


cosh 4>i 


cosh 4> t 


1 — tanh d>i 

— tanh <pi 1 

a, - er t 2~' _l 6; 

1 

-cr.-2 -,-1 a,- 4- 6, 



i = 


- 1 , 


(5.26) 


(5.27) 
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where 


and 



tanhd>,- = cr,'2 _,_1 , 

and the hyperbolic sign bit cr, is determined by 


(5.28) 


(5.29) 


cr, = i 


1 , if a,bi > 0, 

— 1 , otherwise. 

If N is large enough, it can be shown [1] that 


V 




= [ JJ cosh 4>i 


(=0 


( N - 1 

n 


1=0 


1 — tanh 4>i 

— tanh </>,• 1 


\ 

/L 


0 


(5.30) 


(5.31) 


(5.32) 


We call fCh = UtLo 1 coshei, 1.2051 the hyperbolic rotation correction factor. 


5.4,3 Cordic cells 

The Cordic implementation of the dual-state systolic array has the same general 
triarray structure as that in Fig. 5.1. Since we have split a rotation into N 
minirotations and one correction factor multiplication, the time needed to per- 
form a basic processor operation in Fig. 5.1 will be divided into jV+ 1 miniclocks, 
too. We notice that rotating two rows of data needs not take place sequentially 
from one column to another, which is the case considered in Fig. 5*1. In fact we 
can broadcast the rotational parameters ((c, s) or (c,S) from the boundary cell 
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to all of its right-hand-side internal cells such that their rotations can be com- 
pleted simultaneously. Unfortunately, in implementing VLSI circuits, the routing 
difficulty incurred due to non-local connections prohibits such a broadcast opera- 
tion. This is the reason why a systolic array processing algorithm is much favored 
from the VLSI viewpoints in addition to its massive parallelism and regularity. 
However, by using Cordic cells, we are able to mimic the row broadcasting of 
rotational parameters with only local connections. 

Cordic rotations distinguish themselves by performing minirotations sequen- 
tially. The minirotational parameters are carried solely in a stream of sign bits. 
These minirotational bits can be sequentially passed along without waiting for 
the availability of the entire bit stream. Therefore, the right-hand-side internal 
cells can start doing minirotations as soon as these sign bits are available. The 
stream of sign bits are propagated horizontally along the right-hand-side internal 
cells. By doing this, new data are skewed with only one miniclock in between, 
instead of one processor clock. We also observe that the Cordic implementation 
reduces the wavefronts of skewed data from an tilting slope of 1 to l/(N -f 1). 
If A is sufficiently large, we can say the data is essentially not skewed and a 
rotation is taking place simultaneously on each row of the triarray. 

Fig. 5.3 (a) and (b) depict the boundary and internal Cordic processing cells, 
while Fig. 5.3 (c) and (d) describe the associated sign bit generator and shift reg- 
ister operation. To differentiate between the following updating and downdating 
operations, all the downdating parameters are enclosed in the parentheses. We 
also use a solid arrow J, to represent a data movement at a system clock rate and 
a dashed arrow > at a miniclock rate in these figures. Along the horizon- 
tal direction, instead of passing the rotational parameters c,s(c,S) at a system 
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clock rate (which is the same as the clock rate along the vertical direction), the 

mmirotational sign bits p,’s (<7,’s) move towards the right at a faster miniclock 
rate. 

The boundary cell of Fig. 5.3 (a) is responsible for determining the sign bits 
Pi s (cr.’s). It has an internal memory to store the diagonal element r in the 
upper triangular matrix. At the first miniclock, i = 0 ( i denotes the miniclock 
index of a complete rotational cycle), it fetches data y from above. During the 
following miniclocks 0 < i < jV, r and y are cyclically fedback to the minirotator 
to successively generate the minirotation sign bit of pi(< 7 <) and propagate it to 
the right-hand-side internal cell. In the last minirotation stage of i = 1 V, the 
internal data r is multiplied by a correction factor K c (K , h ) and restored to its 
local memory. This completes a rotational cycle of the boundary Cordic cell. 

As to the internal Cordic cell in Fig. 5.3 (b), it also takes in external data 
from above in the first miniclock, then successively feedbacks data and rotates 
according to incoming sign bits. In the meantime, these sign bits are also propa- 
gated to the right. In the last miniclock, both data r and y on two feedback arms 
are multiplied by the correction factors, with one restored to its local memory 
and the other output downward for further processing. 

Both boundary and internal Cordic cells share many architectural similarities. 
The differences between updating and downdating in Fig. 5.3 (a) and (b) are: 
(1). the correction factors: (2). the shift register indices differ by one; (3). the 
signs at the lower left adder input of the minirotators. 
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5.5 Recursive Estimation of Optimal Residuals 


We have considered the recursive evaluation of coefficient vector u;(n) in Section 
2. In many applications such as beamformation, array processing and filtering, 
and communication, the optimal weight coefficient vector may not be of direct 
interest. Instead, we may be interested in the newest optimal residual e n which 
is the last (i.e., the <?-th) element of e(n) in (5.4). Information is then extracted 
from the optimal residual for various applications. In this section, we consider an 
efficient implementation to obtain the newest residuals under the up/downdating 
operations. 

From (5.6), we can separate Q(n) into two terms as 


Q(n) 


Qi(n) 

Q 2 {n) 


where Q^n) € 3? px ', Q 2 (n) e 3? ( '- p)x ', such that 


(5.33) 


Qi(n)X(n) = R(n), 


Qi(n)X(n) = 0. 

Also from (5.4), (5.6), and (5.17), we can rewrite the optimal residual vector as 


e(n + 1) = — Q 2 (n + 1) 


vi(n + 1 ) 

i >2 (n + 1) 

Thus, a basic issue is the efficient recursive evaluation of Q 2 (n + 1). Define 


(5.34) 


<2i(rc) 


Q{n+ 1) = 


(5.35) 
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then 


R(n ) : u(n) 

Q( n + l)[-^(n 4- 1) : y{n + 1)] = x£ +1 : y n+1 . (5.36) 

. X n— i+l : Vn-l+1 

From (5.17) and discussions in Section 2, denote 

H(n + 1 ) = H PiP+ 2 • • • #2,p+2#l,p+2, 

G(n + 1) = G PiPi . 2 • • • G ! 2 ,p + 2 (j r i i p + 2 - 

Then we have 

+ 1) = H{n + 1 )G(n + l)Q(n + 1) (5.37) 

if updating is performed first and 

Q(n + 1) = G(n + l)tf(n + l)Q(n + 1) (5.38) 

if downdating is performed first. Suppose updating is performed first, then we 
have 

<?(« + 1) = H{n+ l)<2 u (n+ 1), (5.39) 

where Q u {n + 1) = G(n + l)Q(n + 1) is defined as the Q matrix associated with 
updating only. It can be shown that G is of the form 

Z(n + 1) h(n + 1) 0 

G(n + 1) = F(n+ 1) UUi Ci 0 , (5.40) 

0 0 1 

where Z(n + 1) is a p x p matrix, and therefore Q u is of the form 

Z(" + l)Qi(n) h(n + 1 ) l' 

Q u (n + 1) = F(n + l)Q x (n) riLi c, 0 • (5.41) 

0 0 1 
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It can also be shown that H is of the form 

Z(n + 1) 0 h(n + 1) 

H(n + 1) = 0 10, (5.42) 

k T (n + 1) o nr =l c, 

and therefore Q(n + 1) is of the form 

Z(n + l)Z(n + l)<5i(n) Z(n + l)h(n + 1) h(n+l) 

Q( n + 1) = k T {n+l)Q x {n) IlLi c, 0 .(5.43) 

k T (n + l)Z(n+ l)Qx(n) Jc T (n + l)/i(n + 1) fl f = i c, 

From (5.34) and (5.41), we can obtain the residual vector when the updating 
wavefront passes through the array and is given by 

e u (n + l) -Qf(n)^(n + l)u 1 (n+ 1) 

e u (n + l)= e u ,(n + l) = — FlLi c « ' M n + 1) , (5.44) 

e U2 (n + 1) — U2(^d“l) 

where e Ul and e U2 are the newest residuals associated with the updating and 
downdating respectively at time n + 1. Since at this point the downdating has 
not yet been performed, e U2 (n -f 1) is not considered as a residual. 

From (5.34) and (5.43), we can obtain the residual vector when the downdat- 
ing wavefront passes through the triarray. Again, we are only interested in the 
newest residuals (the last two elements) and are given by 

e i( n + 1 ) _ — n?»i Ci • t>i(n + 1) — h T (n + l)k(n + l)t/ 2 (rc 4- 1) 

- e ^ n + 1 )J 1 -nf=,Q-t- 2 (n+l) 

(5.45) 

where e x and e 2 are the residuals associated with updating and downdating re- 
spectively. From (5.17), it can be seen that vi(n + 1) and u 2 (n + 1) can be 
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obtained naturally from the up/downdating operations in the triarray. If the 
updating parameters c,’s are propagated down to the diagonal boundary cells 
and are cumulatively multiplied as in [41], when the updating wavefront passes 
through the triarray, the term flc, in (5.44) can be obtained. A multiplier call is 
then used to obtain e Ul (n 4- 1) = - nf +1 c, • u x (n + 1) as in [41]. In fact, although 
the window size is l as described in (5.2), the residual e u ,(n + 1) is estimated 
from [x n _( +1 , • • • , x„, x n+1 ] T and [y n _/ +1 , • • • , y n , y n+1 ] T of window size ^4-1 since 
downdating of x n _; +1 has not yet been performed. That is 

e uii n + 1) = X n+1 ^’[n— (+1.JI+1] — Z/n+1? (5.46) 

where th[n_/+i,n+i] denotes the optimal coefficient vector estimated from data 
[x„_, +1 , • • • , x„, x n+1 ] T and [?/ n _/ + i ,■••,?/„, y n +i] r . 

Also, when the downdating wavefront passes through the triarray, if c/s are 
propagated down to the diagonal boundary cells and are cumulatively multiplied. 
From (5.45), the downdating residual can be obtained easily. It is estimated from 
[x n _ <+ 2 , • • • , x n , x n +i] T of window size £. That is, 

e 2 (n + 1) = x£_, +1 uj [n _, + 2 , n+1] - y„_j +1 . (5.47) 

Obviously, the residual at time n - / + 1 is post estimated by data from n- 1 + 2 
to n + 1 and appears at time n + 1. This kind of property may or may not be of 
practical interest in real-life applications. As to the updating residual e x (n + 1), 
due to the term h^(n+ l)£(n + 1 )u 2 ( n 4* 1) which is not available from the systolic 
implementation, we are unable to extract e x (n 4- 1) from the triarray. However, 
(5.45) provides a simple relation for this updating residual before and after the 
downdating. That is, 

e i( n + 1) = e Ul (n 4- 1) - h T (n + l)£(n 4- l)u 2 (n 4- 1). (5.48) 
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If downdating is performed first, then by the same argument as above, we can 
obtain 

p 

e d2 {n + 1) = - JJ Ci ■ v 2 (n + 1) = xJL /+1 t2i [n _ J+2iB] - y n _ l+1 , (5.49) 

1=1 

and 

p 

ei (n + 1) = - n c « ' Mn + 1) = x^ +1 ih( n _ i+2>n+1] - y n+1 . (5.50) 

i=l 

From (5.4) and (5.S), it is obvious that this e x (n + 1) is the exact residual we 
are looking for. However, a drawback for this scheme is that downdating first 
before the updating may incur a numerical stability problem [2]. A dual-state 

up/downdating systolic array for the recursive residual estimation is also shown 
in Fig. 5.1. 


^•6 Tradeoffs Between Numerical Stability and 
Hardware Complexity 

From the numerical stability point of view, the usage of hyperbolic rotation for 
downdating may be objectionable, even though it has been shown to be forward 
(weakly) stable in [3]. One of the reasons is that, from Fig. 5.2, c and J generated 
by the boundary cell can be very large. Once these c and s are sent to the internal 
cells for further processing, the computations may involve two other parameters 
r and z which are not scaled (he., they can also be very large). Therefore, z' and 
the updated r may suffer large amount of roundoff errors, or even overflows. To 
stabilize this problem, we consider an algebraically equivalent set of orthogonal 
parameters to replace c and 3. Along this line, let us first consider the relation 
between updating and downdating. 


71 



Suppose we have a known p x p upper triangular matrix R and want to 
downdate a vector x to obtain an upper triangular matrix R. That is, 

R t R=R t R — xx t . (5.51) 

If we know R instead of i?, then 


R t R + xx r = R T R 


(5.52) 


becomes an updating problem. 

To downdate i?, we use a sequence of hyperbolic rotations to zero out x as 


R 

0 


= H. 


p.p+i 


• ■ H2,p+iHi iP+l 


R 


(5.53) 


On the other hand, to update R , we use a sequence of Givens rotations to zero 
out x as 


(jp.p+i ' ' ' Gj.p+iGi.p+1 


R 


R 

. xr . 


0 


(5.54) 





R fc ~ 1 


R 



i,p+i • • • Gh,p+! 

. xT . 

— 





1 

ic, 

H 

r-4 

1 

H 

o 

o 
1 


Suppose now, for this updating problem, we know R instead of R, and A: — 1 
updating has been done. That is 


(5.55) 


where R k ~ 1 denotes the first k — 1 rows of /?, R p -k+i denotes the last p — k + 1 
rows of R , and denotes an element of the k - th updated vector x. At the A-th 

rotation, let us focus only on the fc-th and the last rows, then we have 

' ' ' 5 Cfc.fc) ‘ ‘ ‘ i 

0, 


Ck 

Sk 


. ~ sk 

c k _ 



0 x (k ~ 1] 

, u, x k , 


X (k-X) 

p 
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°» • 1 0, r k>k • • • 

0 • • • n n id*) 

U ’ ? u, u x t+1 , 


‘ ‘ * r k, f 


where r itj and r t>i are the (i,j) elements of R and R respectively, 


and for j = k + 1 , 


r k,k = V Hie + x k l) \ 


— _ x i k 

C k — , $k — 

r k,k r ktk 


r k,j = c k r k ,j + s k xy 

J k ) _ 


—Skh,j + C k x) 


(5.58) 


From (5.57) and (5.58), we can solve r and ad** easily and the downdating can 
now be achieved by using Givens rotation parameters given by 


for j = k+ 


r k,k x k 


r k,k x[ k 

Ck = — ,s k = -= . 

r k,k r k ' k 


f k,j = ( r kJ - s k x {k 1 } )/c k , 
x f^ = —Skf k ,] 4- c k xf~ l) . 


(5.59) 


(5.60) 


Since c k and s k are bounded parameters (he., < 1), even though r ktJ can be very 
large when computing x ^ \ it is only multiplied with s k which is bounded. That 
is to say, r k ] will never be magnified during the computation. Thus, this scheme 
is more stable than hyperbolic rotation. The operations of the processing cells 
for the systolic implementation are shown in Fig. 5.4. 

With this transformation, the operations of the downdating part are differ- 
ent from that of the updating part. However, both provide stable numerical 
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results especially under finite precision computations. If we want to make both 


operations identical (except fro sign differences) for implementational efficiency. 
Transformations on c and s of Fig. 5.2 can be performed such that new c and s 


are given by 

.1 .3 

C = -, 3 = -. 

c C 

Therefore we have 


(5.61) 


r = ( r + sy)/c , 

V' = -sr + cy, (5.62) 

which share the same form as that of the corresponding downdating operation. 
However, it loses the numerical property of an orthogonal transformation and 
hence is less desirable under finite precision computations. The operations of 
this updating transformation are shown in Fig. 5.4. 


5.7 Conclusions 

A dual-state systolic triarray performing up/ down-dating operations for fixed- 
window RLS filtering has been proposed. While previous researches have been 
centered on QRD-based systolic triarray with exponentially forgetting factors to 
perform updating, no suitable VLSI architecture (such as that based on systolic 
array), has been proposed to perform fixed-window RLS filtering. 

Due to the inherent similarity between updating and downdating, they can 
use the same hardware and alternatively pipelined to achieve parallelism in this 
dual-state systolic triarray. A flip-flop systolic behavior of this array is observed 
both in temporal and spatial domains. We also proposed Cordic cells to mimic 
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broadcasting along the rows in the triarray and thus eliminated the propaga- 
tion of word-level rotational parameters along the rows by using minirotational 
sign bit streams. The hardware complexity of the Cordic cells is quite simple 
and the involved computations comprise of simple arithmetics. Square root and 
division operations are not required. We have also considered some basic sys- 
tolic implementational issues related to the solution of the optimal residuals. To 
remedy potential round-off errors associated with downdating, a modified trans- 
formation has been considered. The issues of numerical stability and pipelined 
computations in the stabilized transformation have also been addressed. 

We have investigated efficient algorithms and architectures for performing 
fixed-window RLS filtering problems that extended previously known results from 
updating to up/down-dating operations. The proposed new dual-state systolic 
triarray architectures appear promising for VLSI implementation. 
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Figure 5.2: Boundary and internal cells. 
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Fig. 4 (b). Internal Cordic processor 
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Figure 5.4: Transformed boundary and internal cells. 
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Chapter 6 

Unified Square-Root-Free 
Rank-1 Up/Down-dating 


Planar (Givens) and hyperbolic rotations are the most commonly used methods 
in performing QRD up/downdating. But the generic formula for these rotations 
require explicit square- root (sqrt) computations, which are quite undesirable from 
the practical VLSI circuit design point of view. Since the sqrt operation takes 
up much area and its computational time is long (due to many iterations), the 
associated area/time efficiency is poor. 

By setting the block size k equal to 1, we can reduce all the vector operations 
down to scalar operations. It can be easily seen that all currently available scalar- 
type algorithms are merely special cases of the block-type algorithms proposed 
in the previous chapters. When k = 1, all the 3 major QRD methods, namely, 
HT, MGS, and Givens, can have square-root-free operations and share many 
similarities. However, if the block size k is not equal to 1, it’s very difficult to 
give a square-root-free version for the HT method, while MGS still enjoys such a 
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fast version as can be seen in Chapter 3. 

After introducing planar (Givens) and hyperbolic rotations, a generalized ap- 
proach to eliminate square-root operations is derived and then it becomes clear, 
that all 2 x 2 orthogonal and pseudo-orthogonal HT, MGS, or Givens rotation 
fall into this generalization. 


6*1 Planar and Hyperbolic Rotations 


A 2 x 2 planar (Givens) rotation matrix is the most fundamental orthogonal 

matrix in performing QR decomposition. A planar (Givens) rotation is given by 
\ c si 


, and is used to premultiply a two-row matrix 


— s c 


<*i «2 • • • ct p 
Jl 02 ■■■ 0p _ 

to introduce a zero element in the (2, 1) location such that it becomes 

Q i <>2 ••• £*p 

.0 0' 2 • • • P' p J ’ 

where 


c = + Pi , 

S = 01 / \f a i + 01 , 

a[ = y^i + 0i , 

a' = cocj + s0j , 

0j = -satj + c0j , 


(6.1) 

(6.2) 

(6.3) 

(6.4) 

j = 2, ••• , p . 

(6.5) 


S3 



Similarly, a hyperbolic rotation matrix is given by 
responding parameters satisfy: 

C = a l/\/ Q l ~ Pi . 


c —s 
—s c 


5 - 01 , 
Q i = PI , 

ca i > 

Pj = -SCtj + C0j , 


jf = 2, • • • , p . 


, and the cor- 


( 6 . 6 ) 

(6.7) 

( 6 . 8 ) 
(6.9) 

( 6 . 10 ) 


6.2 Prototypes of Generalized SQRT-Free Al- 
gorithms 

In VLSI circuit design, square-root operation is expensive, because it takes up 
much area or is slow (due to many iterations). Therefore, it is advantageous to 
avoid square- root operations or minimize the required number of such operations. 
We will focus on how to meet this goal for the 2x2 planar and hyperbolic 
rotations. 

By taking out a scaling factor from each row, the two rows under consideration 
before and after the orthogonal transformations are given by 


and 


i 

a 2 • 

•• a p 


v^T 

0 


a l 

a 2 

a p 

Ji 

@2 ’ 

•• Pp. 


0 

_ 


. *1 

b 2 

••• b p 


a 2 ' 1 



'yfi 

0 


a \ 

a 2 

••• 

0 

Pi ■■ 

• 


0 

•Jk . 


0 

b'i 

... 


(6.11) 


( 6 . 12 ) 
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Now, our task is to find the expressions for k [ , k 2 ,a\, {(a', fe'), j = 2, • • • , p}, 
in terms of k 1 ,k 2 ,{{aj,b j ), j = l,-*-,p}, such that NO square-root operation 
is actually needed. The square-root expressions of \fk[, and yjk[ in 

(6.11) and (6.12) are used for representational purposes only and are not actually 
performed. 

For simplicity, let us focus on planar rotations only, similar derivations for the 
square- root-free hyperbolic rotation can also be obtained by replacing k 2 with —k 2 
and k' 2 with -k’ 2 . Replacing a, = fa7~a : , fa = y^, a' = y£[a', fa = y7“6', 
j = 1, . . . ,p, in (6.1) - (6.5) leads to 


c = j , 

s = + ^-’ 2^1 ^ 

= \/& i^i + ^’2^1 i 

= 


aj 


<2; 4* 


*2&1 


\Ai a i + ^2 H \/k\0. i + Ar 2 6t 

y^6' = 




a + y/^ 1^2 a \ ^ 

\/M? + Mi J ^Mi + k 2 b\ 3 ’ 
After simplifications we have 


j = 2 , 


a i = 


a i = 


*5 = 


M? + Jfc 2 &2 

N *5 

i 

+ m? 

\ZJc\Jc2 


[kia xdj + k 2 bibj] , 


.7=2, 


>P- 


[— ijOj + ai^j] , 


(6.13) 

(6.14) 

(6.15) 

(6.16) 

(6.17) 

(6.18) 

(6.19) 

( 6 . 20 ) 


y^ + £261 

To avoid square-roots, we need to determine k[ and k' 2 such that a[, a' and 
b'- will not require square-root operations. Let us express k{ and k 2 as 

kia\ + k 2 b\ 


K = 




( 6 . 21 ) 


85 



k' — 


JC\ fc 2 


( 6 . 22 ) 


v*{k x a\ + k 2 b\) ’ 

where ^ and v will be determined later to be any square-root-free functions 
of * 1 ,^ 2 , ai, and fq. Indeed, with (6.21) - (6.22), (6.18) - (6.20) can be com- 
puted with no square- roots, and we have the following updating formulas without 
square-roots: 


K 


+ k 2 b\ 

/* 2 

k\k 2 

u 2 {k A a\ + fc 2 6?) “ /i 2 i/ 2 £{ ’ 

Z 2 , 


Ka\ + + 

k\Ci\CLj -f- k 2 b\bj 

55 ’ 


i^[— 6iOj -t- ct x 6j] , 

a i fk i 

JWi' 

h. [El 

fi\j k[ ' 


(6.23) 

(6.24) 

(6.25) 

(6.26) 
(6.27) 
(6.2S) 

(6.29) 


Notice that square-root operations disappear in our formulas of (6.23) - (6.27) 
needed in the planar and hyperbolic rotations. The use of the rotation param- 
eter c in (6.28) (with a square- root operation) will be further considered in the 
next section when the optimum residual e is desired. Later work will show that 
it is possible to obtain e without any square-root operation where the explicit 
computation of the rotation parameter c can be bypassed. To avoid repetitive 
computations and take the advantage of previous computed results, (6.24), (6.26), 
(6.28) and (6.29) use the newly updated k[ of (6.23). As stated earlier, we are 
still free to choose those two parameters \i and u. Different choices of fx and v 
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will affect the number of multiplications and divisions, as well as the numerical 
stability and parallelism of computations. 

It can be shown that this newly derived approach generalizes all of the previ- 
ously known researches on the square-root-free algorithms via an proper choice of 
H and v. Among them are Gentleman(’73) [20], Hammarling(’74) [24], Bareiss(’S2) 
[4], Kalson and Yao(’85) [31], Ling, e£c.(’86) [37], Barlow and Ispen(’S7) [5], Chen 
and Yao(’88) [13], Gotze and Schwiegelshohn(’89) [23]. Table 6.1 lists various 
square-root-free algorithms and the corresponding choices of p and u. 


6.3 SQRT-Free Triangular Array Updating and 
Optimum Residual Acquisition 

In this section, we will apply the prototypes of sqrt-free rotations developed 
above to QRD-based RLS filtering problems. To be specific, we are interested in 
updating from 

T R u 1 

(6.30) 


R 

u 

x T 

!J 

R' 

u' 

0 T 

V 


to 

\ R' u' 1 

(6.31) 

as given earlier in (2.12). It can be shown [41] that the p x p upper triangular 
matrix R ' can be obtained through a sequence of p Givens rotations and the 
optimum residual e for the newly appended data [ x r : y ] is given by 


e = - (]I c «) y ’ 

i=i 

with Ci representing the cosine value of the i-th rotation angles. 


(6.32) 
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Table 6.1: Choices of /x and v for various sqrt-free Givens rotations 
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author( year) comment 

1 

i 

Gentleman( ’73) 

Oj = 1 

k\a\+k2b\ 

k 2h 

-i 

h 

71 


k \ cl ^ -f- kob^ 
k\a\ 

x 

a 1 

71 


k\a^+k 2 $ 

k\a\ 

x 

a l 

Hammarling(’74) 


k\a ^ +^ 2 ^? 
*i«*i 

-1 

Al 



kiay+k^b^ 

k\a\ 

Ml 

k \a^ 

71 


1 

x 

a l 

Bareiss(’S2) 


a x + k 2 b\ 

1 

a 1 -H A: 1 6 ^ 

Ling(’S6),Kalson/Yao(’S5) 

k \ CL i — 1 

k\a\+k 

k\a\ 

x 

a l 

Chen/Yao (’SS) 

A'iai = 1 


1 

Gotze/Schwiegelshohn(’S9) 



2 T3 

Barlo\v/Ispen(’S7) 

Scaled 


SS 











































Factoring out the scaling constants into the pre-multiplying diagonal matrix 
leads (6.30) to the form of 


r Il r l2 

r 22 


• r i P 

• r 2 P 

r pp 


Xi x 2 


«1 
u 2 

U p 

y 


(6.33) 


AT 


A£ 




\Jk _ 


a \\ 12 - ' ‘ «1 p 

a 22 • • • a 2p 


*pp 


&i 6 2 


a i,p+i 

a 2,p+l 

a P.P+l 

Vh 


. (6.34) 


Unlike the previously developed formula, where we are only interested in updating 
to k;, a'ij and zeroing out all the 6,-’ s, this time we do also need to know 
the cosine values explicitly as required in the optimum residual given in (6.32). 
After the first rotation, b\ will be zeroed-out and we have 


d~ kqb^ 

K 1 ~ 5 

Ml 

,(i) _ k±kj 

9 nWiK ’ 


a n — mi 

i 


a ij = 


7 [ + k q bibj ] , 


M\k[ 

7 * = v \ [ —b\a\ j + a\\bj ] , 

an fki 

Mi V k[ ’ 


j — 2, • • • , p + 1 , 


Ci = 


(6.35) 

(6.36) 

(6.37) 

(6.38) 

(6.39) 

(6.40) 
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with (jxi, v x ) being the parameter pair which are still free to be chosen later. Note 
the close analogy of (6.35) - (6.40) to those of (6.23) - (6.28). Similarly, after the 
i-th rotation (1 < i < p) ? we have 


K = 

* *« ' 

A 

(6.41) 


AAH ’ 

(6.42) 

< = 

/*« , 

(6.43) 

< = 


(6.44) 


' * 1% 3= 2, • • • ,p 4- 1 , 


bf = 

u i [ a-ij + a u b^ ^ ] , 

(6.45) 

c, = 

®t‘i I ki 

w V3* 

(6.46) 


After p rotations are finished, (6.34) becomes 

sfc 

s/K 


*11 


*12 


*22 


0 0 


1 ip 

°l,p+l 



l 2p 

®2 } p + 1 

j 

t 

l PP 

S.p+i 

0 

u(p) 

°p+l 


(6.47) 


which has the form of 
now becomes 


R! 

0 r 


e = 


u 


V 


in (6.31). The optimum residual e in (6.32) 





(6.48) 
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To further simplify the expression in (6.48), we notice that k ( q p) can be com- 
puted recursively as follows, 


k\ p) 


\ Wp ) 
( 

\ H v v ) 


k (p-D 

\ ^V-\ v p-\ ) 


k (p- 2 ) 


(6.49) 

(6.50) 


p 

n 

(y/wV 

*+i 

/w ) 


kq 5 


where (6.42) is used in the recursion. 

With (6.51) substituted into (6.48), we have 

' P <*„ k, 


e = — 




(6.51) 


(6.52) 


Because yfh^[bi, 62 , . . . , 6 P , 6 P+1 ] = [i 1 ,x 2 , . . . , x p , y ] is the appended new data row, 
we are certainly free to choose kg = 1 to reduce the arithmetic complexity and 
simplify the expression in (6.52). Therefore, a lemma on obtaining the sqrt-free 
optimum residual is given below. 


Lemma 3 (Sqrt-free optimum residual) 

The optimum residual e can be computed with no square-root operations as fol- 
lows: 




(6.53) 


McWhirter[41] successfully employed Gentleman’s proposition[ 20 ] in comput- 
ing the residual e without sqrt operations. By choosing 


m = = an = 1, 1 < i < p , 


(6.54) 
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the optimum residual can be reduced to 

e — — ^11 ^ 7 ^ &p + 1 (Gentleman/McWhirter). (6.55) 

This result was first observed by McWhirter. 

Another example can be taken from Hammarling’s suggestion[24] as follows, 

*,4 + 

i = 1 , •••,?. 


Mi = 


ki®ii 


"i = 1 / an , 

then it follows that k- = kian/f.n and the optimum residual is given by 


e = 


Q » Mi \ ,(p) 


p + 1 




n 2 

. ,ti Mi^i a.. 

|sr) 

II “fl ^+i (Hammarling). 

. i=l a ii / 


(6.56) 

(6.57) 


(6.58) 

(6.59) 

(6.60) 


6.4 Conclusions 

Planar (Givens) and hyperbolic rotations are the most commonly used methods 
in performing QRD up/downdating. Most of these rotation- based methods re- 
quire explicit square-root computations, which are undesirable from the practical 
VLSI circuit design point of view. Since the square-root operation takes up much 
area and its computational time is slow (due to many iterations), the associated 
area/ time efficiency is poor. This is the first effort to establish the basic un- 
derstanding toward all known square-root-free QRD algorithms, from which the 
basic criterion is seen to be simple. This unified approach also provides a fun- 
damental framework for the square-root-free RLS algorithms which are essential 
for practical VLSI implementations. 
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Chapter 7 


Truncated Least-Squares 
Problems and Applications 


In resolving closely spaced frequencies from limited amount of data samples. 
Tufts and Kumaresan (1982) proposed a SVD-based method to solve the forward- 
backward linear prediction(FBLP) least-squares problem. By imposing an exces- 
sive order in the FBLP model and then truncating small singular values to zero, 
this truncated SVD (TSVD) method yields a very low SNR threshold and greatly 
suppresses spurious frequencies. However, the massive computation required by 
SVD makes it unsuitable for real time super-resolution applications. We propose 
to use truncated QR methods which are amenable to VLSI implementations, such 
as systolic arrays, with slightly degraded performances as compared to the TSVD 
method. Three truncated QR methods for sinusoidal frequency estimation will 
be considered: (1) truncated QR without column pivoting (TQR); (2) truncated 
QR with re-ordered columns (TQRR); and (3) truncated QR with column piv- 
oting (TQRP). It is demonstrated that the benefit of the TSVD method for high 
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frequency resolution is achievable under the truncated QR methods with much 
lower computational cost. Other attractive features of the proposed methods in- 
clude the ease of updating, which is difficult for the SVD method, and numerical 
stability. Thus, the TQR methods offer efficient ways for identifying sinusoids 
closedly clustered in frequencies under stationary and non-stationary conditions. 
Some results based on the truncated normal equation approach as well as on 
sufficient conditions for perfect truncations based on truncated QR and SVD 
methods are considered. Based on the FBLP model, computer simulations and 
comparisons are provided for different truncation methods under various SNR’s. 
Comparisons of asymptotic performance with large data samples are also given. 

7.1 Introduction 

In recent years, there is much interest in seeking efficient and effective algorithms 
for resolving closely spaced sinusoids in the frequency domain as well as in the 
spatial domain [52, 59, 10, 32, 49, 51, 29, 50, 60]. Generally, a “good” algo- 
rithm for spectral estimation should comprise of several factors, such as: high 
frequency resolution capability; computational efficiency; updating and down- 
dating capability; and implementable parallel processing structure so that fast 
real-time applications are possible. Different methods may perform well in some 
aspects but suffer in the others. While the SVD-based method is well known for 
its robustness in resolving closely clustered sinusoids, it is not attractive from the 
other desirable feature points of view. In this paper, we consider several other 
promising approaches based on the truncated QR and least-squares techniques. 

In the pioneering paper of Tufts and Kumaresan [59], a SVD-based method for 
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solving the forward-backward linear prediction(FBLP) least-squares (LS) prob- 
lem was used to resolve the frequencies of closely spaced sinusoids from limited 
amount of data samples. By imposing an excessive order in the FBLP model and 
then truncating small singular values to zero, this truncated SVD method yields 
a low SNR threshold and greatly suppresses spurious frequencies. However, the 
massive computations required by SVD makes it unsuitable for real time super- 
resolution applications. We propose to use truncated QR and LS methods which 
are more amenable to VLSI implementations, such as on systolic arrays [19], with 
insignificantly degraded performances as compared to the TSVD method. Three 
different truncated QR methods will be considered, depending on the ordering of 
the columns of the data matrix. The first one is the truncated QR method without 
column shuffling (TQR). This method does not change the structure of the data 
matrix. A QR decomposition (QRD) of the data matrix is followed by the trun- 
cation of the lower right rank-weakly submatrix of the upper-triangular matrix. 
The second one is the truncated QR method with reordered columns (TQRR). 
The reordering of the columns is determined in an a priori manner [51]. Here 
truncation is performed on the QRD of the column-reordered data matrix. The 
computational cost of this TQRR method is the same as that of the first method, 
except for the column reshuffling. The last one is called truncated QR with col- 
umn pivoting (TQRP) [21]. This method entails a series of dynamic swapping of 
columns while performing QRD. An additional computational cost is required to 
monitor the norms of the remaining columns in the dimension-shrinking subma- 
trix such that the first column is replaced by the one with the largest norm in the 
remaining submatrix. The processing overhead of successive column swapping 
may be nontrivial and prohibitive in implementing a VLSI structure. All these 
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three truncated QR methods only involve a finite number of computations, while 
for the TSVD method, the number of iterations required cannot be specified in 
an exact manner. Based upon MATLAB computations, SVD requires about 5 
to 6 times the number of flops as compared to QRD for a dense 50 x 50 matrix. 
Furthermore, we should note that QRD only requires a small number of flops 
for updating when new data are successively appended, while updating SVD is 
generally much more intractable [11]. A truncated normal equation approach will 
be shown to be equivalent to the TQR method except for the increased roundoff 
errors under finite precision computations. 

A FBLP model for estimating sinusoidal frequencies is formulated first, fol- 
lowed by an introduction of different truncation methods and the minimum-norm 
solutions. Finally, comparisons of these three QR and the LS methods to the 
TSVD method are given based on computer simulations. 


7.2 FBLP Model 

Consider a complex- valued data sequence of length n, 

p 

ii = £ c k e j2wf " + Wi = x,- + Wi, i = 1, 2, • • • , n , (7.1) 

h—\ 

where p is the number of sinusoids , complex- valued c k comprises the amplitudes 
and phases of each sinusoid, and io, is an additive white Gaussian noise. We 
define the signal-to-noise ratio (SNR) as 

SNR(dB) = 201og(||x|| 2 /||u,|| 2 ). (7.2) 
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It can be shown [59] that under noise-free conditions, the frequency locations can 
be obtained by finding the roots of 


S(z) = 1 - £ guz~ k = 0 , (7.3) 

k=l 

on the unit circle, where the complex- valued coefficients g' k s, k = 1,2, •••,£, 
satisfy the following system of FBLP equations 



with £ > p representing the order 


of the prediction model, and * the complex 


conjugate. We will assume that 2 (n — £)>£. For simplicity, denote (7.4) as 


Ag = b , (7.5) 

where the data matrix A and the right-hand-side vector b are constructed from 
the data sequence {x, | i = 1 , . . . , n} in a FBLP manner. Symbolically, this will 
be denoted by [^4:6] = {x, | i = 1 , . . . , n} FBLP . It is noted that the top half of A 
represents the forward prediction and is of Toeplitz form while the bottom half 
represents the backward prediction and is known as a Hankel form. The rank of 
A is p if min{2(n — £),£} > p. When the noise is present, we use an ~ on A and 
b, i.e., A — A + E and b = b+ e, to denote the noise-corrupted FBLP model with 
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the additive noise given by [E : e] = {u>, | i = 1, . . . , n} FBLP . (7.5) now becomes 
the FBLP LS problem of 

Ag^b, (7.6) 

where A usually has full rank due to the perturbation of the noise. One standard 
approach [59] is to use the TSVD method on (7.6) to obtain a rank-p approxima- 
tion of the FBLP matrix A y denoted by A^gVDy followed by solving for a minimum 
norm LS solution of g given by 

A^svd g — b . (7.7) 

Then the frequencies can be computed by finding the phases of the roots of 
(7.3) close to the unit circle or searching for the peaks on the pseudo-spectrum 
l/|5(exp(j27r/)| 2 , —0.5 < / < 0.5. Notice that the proper choice of the pre- 
diction order t depends on p, the number of sinusoids, which may or may not 
be known in advance. Fig. 7.1 depicts a flowchart diagram summarizing the 
estimation of harmonics frequencies based on the FBLP model. 

7.3 Truncation Methods 

In this section, we consider the rank-p approximation of the FBLP matrix A and 
subsequently solve for the minimum-norm solution g ^ . For many LS problems, 
ill-conditioning can be troublesome, and truncation methods are known to be 
useful in stablizing the solutions at the cost of slightly increased residual errors. 
The rationale is that the condition number[21] of a matrix, defined as the ratio of 
the largest to the smallest singular values, can be used to characterize a worst case 
bound on the LS solution when the underlying matrix is subject to some unknown 
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perturbation or round-off errors. When the smaller singular values are discarded, 
their ill effects on the LS solution are reduced(i.e., stabilized), or we may say that 
the new condition number of the truncated system is decreased(i.e., stabilized). 
However, this truncated LS solution, although stablized, will be different from 
the one that gives the minimum residual; instead, its associated residual with 
respect to the original un truncated linear system will be larger. Therefore, the 
tradeoff for the truncated linear system lies between an increased stability versus 
a decreased residual. 

Let 

r Si o i r v h 

A = UXV H = {U X U 2 \ 1 (7.8) 

0 s 2 J [ V» _ 

and 

AU. = QR = [Qj Q 2 ] (7.9) 

0 

be the S VD and QRD of the 2 (n — C) x t complex- valued matrix A respectively, 
where ** denotes the Hermitian of a complex-valued matrix or vector and II is 
a column-permutation matrix and will be explained later. Si = diag(jj i, . . . , a p ) 
and S 2 = diag(ap+i , . . . , cr*) represent nonincreasing singular values. Ru 6 
C p x p , R X 2 G C p x ^ p \ and R 2 2 G while R is an upper-triangular 

matrix. 



U=[U 2 C? 2 ] = 

[fil,. • 

. ,tip, U P+1 ,...,U,] G C 2(n e)xt . 

(7.10) 


V=[Vi V,} = 

[Oi, ... 

, Up, u p+ i, . . . , u<] € C lxt , 

(7.11) 

and 






Q ~ [Qi Qi\ 

= 

••iQpi 9p+ii • • • 1 9 <] G C 2 ^ n ( ' >xt 

(7.12) 

all have orthonormal columns, i.e., 

“““j = = qVqj = S {j . 
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In the absence of noise, £2 = R 22 = 0. Here the permutation matrix II = 
[7^, . . . , 7T/] is used to represent different methods of performing QRD with column 
interchanges. Now we want to preserve as much of the energy as possible (with 
respect to the Frobenius norm defined below) in the trapezoidal matrix [R n R 12 ] 
of (7.9). Equivalently, we want to leave as little as possible the energy residing in 
the lower right submatrix R 22 i which will be truncated. This approach amounts 
to selecting the columns of A in an order such that the column with the largest 
linear independency will be selected first. This procedure is repeated for the 
shrinking submatrix. 

There are at least 3 possible methods for determining the permutation matrix 
II while performing QRD, which are: 

1. For QRD with no pivoting, II is simply an identity matrix. 

2. QRD with pre-ordered columns [51] determines II according to a column 
index maximum-difference bisection rule. Here we select the first and the 

columns, followed by the column halfway between 1 and i. Then 

we pick the columns that lie in the midway of those ones which are already 
selected, i.e., ["(1 + ["^D/'-l 1 f(T^y^)/2] +Q/2], and so on. This selection 
rule does not depend on the real-time data in A. The underlying reason for 
this ad hoc fixed-ordering scheme is to provide the selected columns with a 
possibly maximum differences or minimum linear dependency among these 
columns. This scheme was motivated due to the nature of the matrix A 
arranged in the form of (7.4) consisting of perturbed sums of harmonic 
sinusoids. As an example, suppose there are 5 columns, then the pre- 
ordering strategy leads to [1,5, 3, 2, 4]. Thus we have II = [ej, es, e3, e2, 64], 
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where is a dimension i column vector with all zero components except 
for an one at the i ih position. 

3. As for QRD with column pivoting [21, p. 233], II is determined during 
the QRD process, where i Tj = e ^ and d i € [1,€] is the index such that 
dd t has the largest norm. Continuing with this column-pivoting process 
on the lower right submatrix yet to be triangularized, we can determine 
the permutation matrix II which yields an optimum QRD column ordering 
strategy in the sense of preserving most energy in the upper trapezoidal 
submatrix. However, this II is data-dependent and the extra cost for this 
pivoting may make it less desirable for some applications. 

After forcing those rank-weakly quantities to be zero and preserving the most 
significant p-rank, we can obtain a rank-p approximate of A. These rank-weakly 
quantities are those entries in the factorized matrix that contribute least signif- 
icantly to the matrix, or possess the smallest portion of the energy (square of 
Frobenius norm) of the associated matrix. For TSVD, S 2 is discarded and 

£&vd = UiZxV” . (7.13) 

Similarly, for TQR, the lower-right submatrix R 22 is discarded and 

Ajq R II = Qi[R u R\ 2 }- (7.14) 

To account for the effect due to truncation, we define the fractional truncated 
F-norm as 

^1=1 _ H^W|| f /||A|| f , (7.15) 

where || • \\f is the Frobenius norm given by 

MIIf = S l a '.jl 2 = \J trace {A**A). (7.16) 
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Thus we have 


^flvn = ( 7 . 17 ) 

and 

4qr = i - ( - + liMi V /2 (7i8) 

VII^ii|| 2 f + \\Rn\\ 2 F + WRjiBJ ' 1 } 

While 

® — •^TSVD — -^TQRP < 1 ( 7 - 19 ) 

is valid analytically^ 1], from extensive computations we also observed the rela- 
tionships among truncated QR methods to satisfy 

FtQRP < ?TQRR < ?TQR < 1 . ( 7 . 20 ) 

Therefore from the point of view of preserving the Frobenius norm(square root 
of energy) of a matrix, SVD provides the optimum truncation, with TQRP being 
next, while TQR and TQRR truncate even more. 


^•4 Perturbation of ^Matrix Decomposition and 
Perfect Truncation 


In this section, we examine the effects of the noise matrix E on the decompositions 
^ — A "F E , and associated sufficient conditions for perfect truncations based 
on truncated QR and SVD methods. For simplicity, we only consider the QRD 
method without pivoting. Since the noise- free data matrix A has rank p, we have 


A — [Qi : Q2] 


Rn R\2 

0 0 


( 7 . 21 ) 


— [QiRu ■ Q1.R12] • 


( 7 . 22 ) 
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The noise-perturbed data matrix A can be written 


= A + E 


= [Qi : Q a ) 


= [Qi '■ Q a ] 


= [Q i Q 2 ] 


= [Qi : g 2 ] 


i?ll /?12 


0 0 


+ [Qi '■ Qi] 


R\\ + E“\ R\2 T ET 2 


Qu Q\2 

Q 21 Q 22 

R\i &12 

0 R?2 


R 11 R 12 

0 R?2 


E\i E\2 
&21 ^22 


( 7 . 27 ) 


where 


Q 1 — QiQu + Q2Q211 
Q 2 = Q 1 Q 12 T Q 2 Q 221 


( 7 . 28 ) 

( 7 . 29 ) 


Qn Q \2 R\x R 


Q 21 Q 22 


0 Ro 


= QRD 


Rll + E?, Rl2 + E\2 


, ( 7 . 30 ) 


with QRD{'} denoting the QR decomposition operator applied to the matrix in 
After truncation, we have 


= [Q 1 i Qi] 


R\\ Rr. 

0 0 


( 7 . 31 ) 


— [Q1-R11 • Q\R\ 2 \- 


( 7 . 32 ) 
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Let us define a perfect truncation as the case when 


That is, the original noise-free data matrix A can be fully recovered after trun- 
cating the rank-weakly part of the factorization of the noise-corrupted matrix A. 
A sufficient condition for (7.33) (i.e„ (7.32) to be equal to (7.22) ), is for the noise 
matrix E to satisfy £* = E' n = E‘ n = 0, where 


E 


10. i Os] 


E n EU 

e; . E ; 2 


[0. : Os] 


0 0 

o e; j 


(0 i OsA's] ■ 


(7.34) 


(7.35) 

(7.36) 


(7.36) reveals that the first p columns of A are noise-free and the rest of the 

columns all reside in the orthogonal-complement column space of the data matrix 
A. 


Similarly for SVD, we have 


A 


A + E 


A’2 

^21 ^22 


(£7, i U 2 \ 

Si + £*! 

c* 

°12 


V 


. Si 

cn 

i 


v 2 \ 


[Ux U 2 ] 


Sj 0 
0 0 


V x " 

v» 


+ [U\ : U 2 ] 


(7.37) 


V? 

v« 


(7.38) 


(7.39) 


[U x : U 2 \ 


Uh Un 
U*x Un 


E t 0 
0 E 2 


Vix 


H 


V, 




21 


v* H V* H 
V 12 V 22 


vf 

V? 


(7.40) 
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= [ft I ft] 


Sj 0 

0 t 2 


V? 


(7.41) 


where 


[Ui : # 2 ] = [U\ : C/ 2 ] 


[K = [Vi ; v 2 ] 


u:, ur. 


'xi 

/« 


w. u:. 


12 

r * 

22 


ft. ft 
ft, ft 


(7.42) 


(7.43) 


and 


Un UT-. 


12 


ft. ft, j 


Sj 0 

0 t 2 


y* H \/* h 
Ml '21 

y* H y* H 
V 12 V 22 


= SVD ^ 


Si + £\\ £ 


£* 

C 21 


"12 


"22 


(7.44) 

with SVD{-} denoting the SVD operator applied to the matrix in {•}. 

We also notice that to achieve perfect truncation for the SVD method, we 
need 

Wo ~ tfiSti/" — A . (7-45) 

A sufficient condition for (7.45) is S t \ = S' 2 = S 21 = 0 (i.e., E = U 2 €; 2 V 2 H ) and 
the largest singular value of E is less than the smallest singular value of Si, or 
equivalently, the row and column spaces of E must be orthogonal to those of A. 
and there exists a gap between the singular values of A and E such that even the 
weakest signal subspace will not be corrupted by any erroneous noise space. 


7.5 Minimum-norm Solutions 

After truncation, the FBLP LS problem becomes rank-deficient, hence the minimum- 
norm LS solution is desired in order to suppress those spurious harmonics in the 


105 



pseudo-spectrum. The following lemma gives the minimum-norm solution for a 
rank-deficient LS problem. 


Lemma 4 (Minimum-norm solution) 

For an underdetermined LS problem, 

By = c, (Be C pxt , c eC p , P < t), (7.46) 


with rank(B)=p. The minimum-norm solution y is in the row space of B. 

Proof. S uppose B has full row rank and y belongs to the row space of B, i.e., 
y = B H z, then there exists a unique solution y = B H z since z = ( BB H )~ l c is 
unique. Suppose there are other solutions of the form y = y + , where y^ 

lies m the space perpendicular to the row space of B, i.e., By L = 0. Then it is 
obvious that 

IN 2 = llyll 2 + !|y J -|| 2 > ||y|| 2 . (7.47) 

So 


y = ar 9 min{||y|| | By = c}. 


For simplicity, let IT be an identity in the QR.D case. To avoid the cumbersome 
normal-equation-like computation of the minimum norm solution, B H (BB H )~ l c, 
with B = (i?n ^ 12 ] and c = Q^b, we can firstly perform a backward Q R decompo- 
sition on the conjugate transpose of [i?n f?i 2 ] to obtain the same solution as given 
in the above lemma without fill-in’s(the newly introduced nonzero entities while 
performing QRD). Ill-conditioning will not occur because the diagonal elements 
are sufficiently large in the trapezoidal truncated matrix. By doing backward 
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(7.48) 


modified Gram-Schmidt orthogonalization[21] procedure, we can have 


B h 




R H 

n \2 


= TL, 


( T e C tXp - L € C pXp ), 


with L being lower triangular. Here the columns of T - [t lt . . . , t„] (satisfying 
T H T = I) are computed in a backward manner, i.e, from t p to t lt and the 
diagonal elements of L are computed from the lower right toward the upper left. 
We note that the minimum solution for the truncated QR method is given by 


9 p tqr = B h {BB h )~'c 

= TL{L"T"TL)- 1 c (7.49) 

= T{L~" c), 


where it is also noted that a backward substitution is required in the computation 
of L~ h c. 

Therefore the minimum-norm LS solution can be obtained via the following 
procedure: 


1. Do QRD on the augmented matrix [Alb] (with possibly column pivoting); 

2. Take the transpose of the trapezoidal upper triangular matrix. Do backward 
QRD (save the orthogonal matrix T ) ; 

3. Apply backward substitution on the transpose of lower-triangular matrix 

obtained in step (2) and the updated right-hand-side in step (1), followed 
by (7.49). 

According to this lemma, a minimum-norm solution vector g ^ must lie in 
the row space of the rank-reduced matrix .4^ p) , namely, the row space of V X H or 
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[•flu -fiin]- For TSVD it is given by 


9TSVD = Vi Zf'U“b, 


(7.50) 


or 


a (p) _ y' ~ 
9tsvd — 2^ ~ — u i • 


i-i ^ 


(7.51) 


To obtain g { f QR , we can perform QRD on the right of the trapezoidal upper- 
triangular matrix in (7.14) to zero out R u and also obtain the orthonormal row 
space, T H , of [i? n R x 2 ], That is 


Ajq R n = Qi[R n R 12 ] = Q l L H T H , (7.52) 

where f = [<i,...,t p ] € C txp has orthonormal columns and L H £ C pXj> is an 
upper-triangular matrix. This is sometimes called a complete orthogonal fac- 
torization [21, p. 236], and we can consider it as a two-sided direct unitary 
transformations on a rank-deficient matrix to compress all the energy of a matrix 
into a square upper-triangular matrix. This resembles the SVD method where 
two-sided iterative unitary transformations are applied to reduce a matrix into a 

diagonal matrix. Then from (7.49) and (7.52) the minimum-norm solution follows 
by 


Jp) 

9tqr 


= n 




(RnR^ + R^R^Qfb 


= nfZ- H Q*b . 


(7.53) 

(7.54) 


It is noted that if no truncation is performed at all and A has full column 

rank l, then the LS solution from the FBLP model is either obtained from SVD 
as 


9 


Vi + V 2 ti l U?b 
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(7.55) 


or from QRD as 


~ 9tsvd + 23 — Vj , 


an 


j=j>+i a i 


9 


n 


/in 

R 12 

-1 

t-O 

as- 
'O 
1 

0 

R-22 


. Qfb J 


Rn -RuRnRil 

o r ~ 2 1 


Qfb 

Qfb 


n 


RnQfb-R^R l2 R-^Qf b 

R 22 Qfb 


(7.56) 


Because S 2 and R 22 are both nearly zero under a high SNR condition, slight 
variations on them will cause significant perturbations in the solution vector g 
and hence leads to many spurious frequencies in the pseudo-spectrum. Now it 
becomes clear why one truncates these rank-weakly quantities to remedy these 
ill-conditions from the view point of numerical stability and also prefilters some 
stray noise in an attempt to guard against possible contaminations in the pseudo- 


spectrum. 


For many problems, the conservative approach of over-modeling (i.e., I >> p ) 
is preferred [59, 49] to taking l ~ p, since we can later truncate some noises that 
reside in the null space which is orthogonal to the signal space. The advantage 
of over-modeling is to provide some extra dimensions to trap the stray noises 
and then remove them by truncation. This is an effective way of enhancing the 
SNR. However, there is always the danger that some signal has been mistakably 
truncated in low SNR cases where ambiguous changes in the truncated F-norm 



IS possible. On the other hand, spurious frequencies are still very likely to occur 
when there is insufficient truncation of the rank of the data matrix. 

7.6 Truncated Normal Equation Approach 

Up to now, the matrix decompositions are performed with the direct data, hence 
the condition number of the problem is not increased. It is interesting to see that 
there also exists a truncated normal equation (TNE) solution which computes 
the minimum norm solution for the covariance data. We show that essentially 
this TNE method is mathematically equivalent to the TQR method except for 
increased roundoff errors under finite precision computations. 

An untruncated least-squares solution for (7.6) using the normal equation 
approach is to solve 

A H Ag = A H b. (7.57) 

If we rewrite A as [A, : A 2 ) = [A l \A l P + N], where F € represents the 

projection of A 2 onto A x and N € represents the remaining residual. 

Therefore, columns of A x € and .V € C 2 '"-^'-* are orthogonal to each 

other. Under high SNR cases, N is close to a zero matrix, and in the extreme 

case when the noise is absent, N is equal to zero. Then (7.57) can be rewritten 
as 

A Ai A^ A 2 

a"a x a?a 2 

or 

A? Ax A»A 

. F H A”A l + N H A 2 F h A"A 2 4 


9 = 


Afb 

Afb 


(7.5S) 


- N h A” j 


9 = 


Afb 
L Afb 


(7.59) 
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We can see that as N goes to zero, the bottom ( £-p ) equations in (7.59) become 
redundant, because they are merely equal to F H times the top p equations. 
Therefore, a truncated normal equation (TNE) can be defined by discarding the 
almos t- redundant bottom (£ - p) equations in (7.59). 

Similarly, a minimum norm solution (from the above Lemma) follows by 


>) „ 

9tne — 


A?A i 
A? A\ 


[A x A x a” A x + A* A 2 A% A x )~ x • (A^S). 


(7.60) 


It remains to show that TQR and TNE methods in (7.53) and (7.60) are math- 
ematically equivalent. To this end, we can replace A x by Q x R n and A 2 by 

QiRn +Q2R22, hence we have A? A, = R"R n and A*A X = R^R U . After some 
manipulations, (7.60) can be written as 


3tIie — 




+ jjuSJDa,,]- 1 ■ (Q?b) , (7.61) 


which is equal to g^qj j in (7.53) where IT is an identity. 

We must note that although TQR and TNE methods are mathematically 
equivalent, the former is much favorable under finite precision computation. The 
rationale is that the TQR method always deals with the direct data, while the 
TNE method works on the covariance data where the dynamic range of data is 

inevitably squared. Therefore the TNE method is more susceptible to roundoff 
errors. 


7.7 Simulation Results 

Finally, we present various computer simulations based on the following model. 
Let x, = cos(2Trf x i) + cos(2ir f 2 i) + W{,i = 1,2, •••,48-, with f x = .125, f 2 = 
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.135, C = 36 and {to,} is a white Gaussian random sequence. The frequencies 
are determined by the phases (from 0 to ») of complex roots closest to the unit 
circle. For TQRR, we pre-permute the columns of the FBLP matrix in the order 
of: {1 36 18 9 27 5 ... } as suggested by (51). We will consider three quantities 
on the evaluation of the performances. The first one is the frequency bias, which 
is defined as the difference of the true and estimated frequencies. The standard 
deviation of the estimated frequency is our second performance measure. The 
last one is the distance of the third principal root to the unit circle. Here we 
represent the principal root as those roots that are close to the unit circle. Ideally 
e should have only 2 principal roots (due to the two sinusoids if we only consider 
those roots with phases within [0, »]), Ming exactly on the unit circle, while all 
the others are inside the unit circle. Under moderate SNR conditions, a third 
principal root may be mistaken as a third candidate harmonic, if its distance to 
the unit circle is approximately equal to the first 2 principal roots. An even worse 
condition may occur when the true harmonic falls behind (he., further away from 
the unit circle) an spurious harmonic due to the random noise. This is similar 
to the case that a noise subspace enters the signal subspace. Therefore, a good 
separation of the 3rd harmonic from the unit circle not only decreases the chance 

of mistaking false harmonics but also increases our confidence on estimating the 
true number of harmonics. 

Two classes of comparisons will be considered in the following curves. The 
first one is to compare these truncation methods under various SNR from 0 to 
50 dB. The second is to observe the asymptotic performance by fixing the order 

1 = 36 ’ and increasin S the number of observed data samples. 100 independent 
simulations are used to obtain the statistical means and standard deviations. 
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Fig. 7.2 gives the average fractional truncated Frobenius norms of (7.16) 
versus SNR when we preserve only the four most significant ranks of the FBLP 
matrix for the five different methods. This confirms their relationships in (7.19) 
and (7.20) and also shows that the truncated energy decreases monotonically as 
SNR increases. Fig. 7.3 and 7.4 show the averages of the frequency biases for 
the two harmonic frequencies. We define the average frequency bias as E(f k ) - 
/*, k — 1,2, where E{f k ) is the ensemble average of /*, which is the estimated 
frequency for f k . Fig. 7.5 and 7.6 show the standard deviations of f x and f 2 . 
We can see that TQRP competes quite well with TSVD, while TQRR performs 
slightly worse than TQRP but better than TQR without pivoting. Fig. 7.7 and 
7.8 show the distances to the unit circle of the first 2 dominant roots that are 
closest to the unit circle. Fig. 7.9 gives the distance to the unit circle of the third 
closest root. Since this third root is a false one, it should be far away from the 
unit circle to allow for easy determination of number of harmonics. 

If we fix the SNR= 10 dB and the order of the FBLP model to be l = 36, 
as more data are collected, the ill effect due to noise should be asymptotically 
smoothed out. Fig. 7.10 shows the combined average frequency bias (defined as 
the sum of the absolute values of the biases for f x and f 2 ) verses the number of 
data samples. Fig. 7.11 shows the curves of various combined standard deviations 
of the estimated frequencies which is defined as the square root of the sum of 
squares of the standard deviations of each frequency estimate. Fig. 7.12 depicts 
the mean distances to the unit circle of the false harmonics. From Fig. 7.10 to 
7.12, it is clear that under moderate SNR conditions, the performances of TQRP 
closely follow that of TSVD. 
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7.8 Conclusions 


While a myriad of researches have been focused on S VD and eigen-decomposition 
analysis of narrowly spaced harmonic frequency estimations [59, 49, 29], very few 
have been directed towards the QRD approaches. Owing to the iterative massive 
computations and the difficulty encountered in updating the decompositions [11] 
when new data are acquired under time- varying conditions, these SVD and eigen- 
based approaches are ill suited for real time applications. It is well known [21] 
that QRD is numerically as stable as SVD, requires much less computational cost, 
easy to update(and/or downdate), and amenable to VLSI implementations. The 
slightly degraded performance for these truncated QR methods is greatly com- 
pensated by all the benefits mentioned above. As well known, the performance 
of the LS method is usually much worse than those of the QR and SVD methods. 

Table 1 summarizes the comparisons among different truncation methods. We 
conclude that TQR is the simplest and can be performed easily in a real time 
updating, but may suffer significant degradation. TQRP provides almost the 
same performance as SVD, but is not easy to implement in real time processing 
in that the difficult column reshuffling is required while performing QRD with 
pivoting. TQRR provides a good compromise between the above two and can 
also be implemented for systolic array processing. The LS method is simple to 
implement and update but has a poor frequency estimation capability. 
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Figure 7.1. Block diagram for sinusoidal frequency estimation based on the FBLP 


Table 7.1: Comparisons of truncated least-squares methods. 



Freq. est. 

Comput. cost 

VLSI 

updating J 

TSVD 

TQRP 

TQRR 

TQR 

LS 

excellent 
very good 
good 
fair 
poor 

very high 
medium 
fair 
fair 
low 

complex 

medium 

fair 

fair 

low 

difficult I 
medium 
easy | 
easy 
easy | 
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Figure 7.3: Mean frequency estimates for / x = .125 using a 24 x 36 FBLP matrix. 
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mean estimated f 2 




Figure 7.6: Standard deviations for estimating fi = 
matrix. 


.135 using a 24 x 36 FBLP 



Figure 7.7: Mean distances to unit circle of the roots of the 1st harmonic freq. 
estimator vs. SNR. H 
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Figure 7.8: Mean distances to 
estimator vs. SNR. 


unit circle of the roots of the 2nd harmonic freq. 



Figure 7.9: Mean distances to unit circle of the 
freq. estimator vs. SNR 


roots of the 3rd (false) harmonic 
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Sro!!iL® : M f an J freq bi “ vs - no - of dat ^ samples for / = {.125, .135}, 
SNR— lOdB, and order=36. 



Figure 7.11: Standard deviations of estimates vs. no. of data samples for f 
{.125, .135}, SNR=10dB, and order=36. 
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Figure 7.12: Mean distances to unit circle of the roots of the 3rd 
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Chapter 8 


Conclusions 


QRD-based methods for RLS estimations are well known to be useful and effective 
in adaptive signal processing. A time- recursive formulation of RLS filtering using 
block QRD is given for efficient up/down-dating operations. To monitoring the 
whole residual vector a recursive formula is derived. Based on this formula, a 
back substitution for obtaining the optimum weight vector can be bypassed. 

Existing methods of performing RLS estimation either employ exponentially 
weighted or fixed-window schemes under non-stationary conditions. Both are less 
capable in rejecting the ill effects due to temporary noise spikes. A residual-based 
selective window for robust RLS estimation is proposed. Computer simulations 
show that its abilities in tracking and reducing the bias of parameter estimation 
due to spurious noise spikes are superior to those of the existing methods. This 
new method only requires one back substitution to estimate the weight vector, 
while a conventional method requires two such operations. 

Various up/downdating algorithms and systolic architectures are proposed 


122 



for block-type RLS estimations, which include Gram-Schmidt pseudo orthogo- 
nalization methods, and hyperbolic Householder transformations. Conventional 
Gram-Schmidt methods (modified or not) are reformulated such that systolic 
processing is possible and the computational flops are also greatly reduced. 

When the block-size is reduced to be one, all vector processing becomes scalar 
processing. A dual-state systolic structure using planar (Givens) and hyperbolic 
rotations is proposed to perform joint up/down-dating operations. To further 
reduce the complexity and increase the throughput rate, square- root- free Cordic 

cells are proposed to mimic the broadcasting along the rows in the systolic triar- 
ray. 

Most of the rotation-based methods require explicit square-root computa- 
tions, which are undesirable from the practical VLSI circuit design point of view. 
A unified square-root-free rank-1 up/downdating approach is proposed. This is 
the first effort to establish the basic understanding toward more than ten known 
square- root- free QRD algorithms, from which the basic criterion is seen to be 
simple. This unified approach also provides a fundamental framework for the 

square-root-free RLS algorithms which are essential for practical VLSI imple- 
mentations. 

Three truncated QR methods are proposed to estimate closed spaced frequen- 
cies from limited amount of data samples. They include truncated QR methods 
without column pivoting, with column pivoting, and with re-ordered columns. 

Detailed comparisons of these methods to truncated SVD method have been 
given. 

Table 8.1 summaries the author’s contributions and previous authors’ works. 
Matlab programs have been written for all of the algorithms addressed in this 


123 



Table 8.1: List of previously known and new results 


L Item Previous known results New Results 

Res. monit. 

Rader( ! S6):two passes 

one pass 

Syst. RLS Filt. 
w / Forget. Fac. 

Gentleman/ Kung( ’81 ): 
Scalar-valued Givens 

Block- type, 
Householder, MGS 

Syst. RLS Res. 
Upd. w/ Forg. Fac. 

McWhirter(’83): 
Scalar-type Givens 

Block- type, 
Householder, MGS 

Block RLS Filt. 
w/ Fixed- Win. 

Rader/S teinhardt(’S6): 
Hyper. Householder 

Hyperbolic MGS 
w/ out sqrts 

Scalar RLS Filt. 
w/ Fixed-Win. 

Hansen/Lawson(’74), 
Alexander, etc.(’S7), 
Bojanczryk,etc.(’87): 
Hyper. Rot. 

Dual-State Syst. 
Implement. 
CORDIC cells 

Modified GS 

Rice(’66): Load Imbal. 


Sqrt-Free 

Many 

Generalized Appr. 

Spec. Est. 

Tufts/Kumaresan(’S2): 
Truncated SVD, 
Comput. Extensive, 
Diff. to Update 

Truncated QR 
Less comput. 1 

Easy to Update 


thesis. Future work of interest includes : numerical comparisons of various square- 
root-free fast Givens algorithms, applications of rank- revealing QR (RRQR)[12] 
to FBLP-based spectral estimations. 
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